What do You Meant by DataWarehouse.

Data Warehouse is a central managed and integrated database containing data from the operational sources in an organization (such as SAP, CRM, ERP system). It may gather manual inputs from users determining criteria and parameters for grouping or classifying records.
A source for the data warehouse is a data extract from operational databases. The data is validated, cleansed, transformed and finally aggregated and it becomes ready to be loaded into the data warehouse.

That database contains structured data for query analysis and can be accessed by users. The data warehouse can be created or updated at any time, with minimum disruption to operational systems. It is ensured by a strategy implemented in a ETL process.
Data warehouse is a dedicated database which contains detailed, stable, non-volatile and consistent data which can be analyzed in the time variant.
Sometimes, where only a portion of detailed data is required, it may be worth considering using a data mart. A data mart is generated from the data warehouse and contains data focused on a given subject and data that is frequently accessed or summarized.

Business Intelligence - Data Warehouse - ETL:

Keeping the data warehouse filled with very detailed and not efficiently selected data may lead to growing the database to a huge size, which may be difficult to manage and unusable. To significantly reduce number of rows in the data warehouse, the data is aggregated which leads to easier data maintenance and efficiency.

Key Data Warehouse systems and the most widely used database engines for storing and serving data for the enterprise business intelligence and performance management:

· Teradata

· Oracle

· SAP BW - Business Information Warehouse (SAP Net weaver BI)

· Microsoft SQL Server

· IBM DB2 (Info sphere Warehouse)

· SAS

Data Warehouse Tools and Utilities Functions

The following are the functions of Data Warehouse tools and Utilities:

Data Extraction - Data Extraction involves gathering the data from multiple heterogeneous sources.

Data Cleaning - Data Cleaning involves finding and correcting the errors in data.

Data Transformation - Data Transformation involves converting data from legacy format to warehouse format.

Data Loading - Data Loading involves sorting, summarizing, consolidating, checking integrity and building indices and partitions.

Refreshing - Refreshing involves updating from data sources to warehouse.

Note: Data Cleaning and Data Transformation are important steps in improving the quality of data and data mining results.

Data warehouse is subject Oriented, Integrated, Time-Variant and nonvolatile collection of data that support of management's decision making process. Let's explore this Definition of data warehouse.

Subject Oriented - The Data warehouse is subject oriented because it provide us the information around a subject rather the organization's ongoing operations. These subjects can be product, customers, suppliers, sales, revenue etc. The data warehouse does not focus on the ongoing operations rather it focuses on modelling and analysis of data for decision making.

Integrated - Data Warehouse is constructed by integration of data from heterogeneous sources such as relational databases, flat files etc. This integration enhance the effective analysis of data.

Time-Variant - The Data in Data Warehouse is identified with a particular time period. The data in data warehouse provide information from historical point of view.

Non Volatile - Non volatile means that the previous data is not removed when new data is added to it. The data warehouse is kept separate from the operational database therefore frequent changes in operational database is not reflected in data warehouse.

Metadata - Metadata is simply defined as data about data. The data that are used to represent other data is known as metadata. For example the index of a book serve as metadata for the contents in the book.In other words we can say that metadata is the summarized data that lead us to the detailed data.

In terms of data warehouse we can define metadata as following:

Metadata is a road map to data warehouse.

Metadata in data warehouse define the warehouse objects.

The metadata act as a directory.This directory helps the decision support system to locate the contents of data warehouse.

DataWarehouse Architecture

The main difference between the database architecture in a standard, on-line transaction processing oriented system (usually ERP or CRM system) and a DataWarehouse is that the system’s relational model is usually de-normalized into dimension and fact tables which are typical to a data warehouse database design.
The differences in the database architectures are caused by different purposes of their existence.

In a typical OLTP system the database performance is crucial, as end-user interface responsiveness is one of the most important factors determining usefulness of the application. That kind of a database needs to handle inserting thousands of new records every hour. To achieve this usually the database is optimized for speed of Inserts, Updates and Deletes and for holding as few records as possible. So from a technical point of view most of the SQL queries issued will be INSERT, UPDATE and DELETE.

Opposite to OLTP systems, a Data Warehouse is a system that should give response to almost any question regarding company performance measure. Usually the information delivered from a data warehouse is used by people who are in charge of making decisions. So the information should be accessible quickly and easily but it doesn't need to be the most recent possible and in the lowest detail level.
Usually the data warehouses are refreshed on a daily basis (very often the ETL processes run overnight) or once a month (data is available for the end users around 5th working day of a new month). Very often the two approaches are combined.

The main challenge of a Data Warehouse architecture is to enable business to access historical, summarized data with a read-only access of the end-users. Again, from a technical standpoint the most SQL queries would start with a SELECT statement.

In Data Warehouse environments, the relational model can be transformed into the following architectures:

1.STAR SCHEMA

2.SNOWFLAKE SCHEMA

3.CONSTELLATION SCHEMA

STAR SCHEMA

Star schema architecture is the simplest data warehouse design. The main feature of a star schema is a table at the center, called the fact table and the dimension tables which allow browsing of specific categories, summarizing, drill-downs and specifying criteria.
Typically, most of the fact tables in a star schema are in database third normal form, while dimensional tables are de-normalized (second normal form).
Despite the fact that the star schema is the simpliest datawarehouse architecture, it is most commonly used in the datawarehouse implementations across the world today (about 90-95% cases).

Fact table

The fact table is not a typical relational database table as it is de-normalized on purpose - to enhance query response times. The fact table typically contains records that are ready to explore, usually with ad hoc queries. Records in the fact table are often referred to as events, due to the time-variant nature of a data warehouse environment.
The primary key for the fact table is a composite of all the columns except numeric values / scores (like QUANTITY, TURNOVER, exact invoice date and time).

Typical fact tables in a global enterprise data warehouse are (usually there may be additional company or business specific fact tables):

· sales fact table - contains all details regarding sales

· orders fact table - in some cases the table can be split into open orders and historical orders. Sometimes the values for historical orders are stored in a sales fact table.

· budget fact table - usually grouped by month and loaded once at the end of a year.

· forecast fact table - usually grouped by month and loaded daily, weekly or monthly.

· inventory fact table - report stocks, usually refreshed daily

Dimension table

Nearly all of the information in a typical fact table is also present in one or more dimension tables. The main purpose of maintaining Dimension Tables is to allow browsing the categories quickly and easily.
The primary keys of each of the dimension tables are linked together to form the composite primary key of the fact table. In a star schema design, there is only one de-normalized table for a given dimension.

Typical dimension tables in a data warehouse are:

· time dimension table

· customers dimension table

· products dimension table

· key account managers (KAM) dimension table

· sales office dimension table

Star schema DW architecture:

SNOWFLAKE SCHEMA

Snowflake schema architecture is a more complex variation of a star schema design. The main difference is that dimensional tables in a snowflake schema are normalized, so they have a typical relational database design. Snowflake schemas are generally used when a dimensional table becomes very big and when a star schema can’t represent the complexity of a data structure. For example if a PRODUCT dimension table contains millions of rows, the use of snowflake schemas should significantly improve performance by moving out some data to other table (with BRANDS for instance). The problem is that the more normalized the dimension table is, the more complicated SQL joins must be issued to query them. This is because in order for a query to be answered, many tables need to be joined and aggregates generated. An example of a snowflake schema architecture is depicted below.

Snowflake schema DW architecture:

Fact constellation schema

For each star schema or snowflake schema it is possible to construct a fact constellation schema. This schema is more complex than star or snowflake architecture, which is because it contains multiple fact tables. This allows dimension tables to be shared amongst many fact tables.

That solution is very flexible, however it may be hard to manage and support.
The main disadvantage of the fact constellation schema is a more complicated design because many variants of aggregation must be considered.
In a fact constellation schema, different fact tables are explicitly assigned to the dimensions, which are for given facts relevant. This may be useful in cases when some facts are associated with a given dimension level and other facts with a deeper dimension level.
Use of that model should be reasonable when for example, there is a sales fact table (with details down to the exact date and invoice header id) and a fact table with sales forecast which is calculated based on month, client id and product id.
In that case using two different fact tables on a different level of grouping is realized through a fact constellation model.
An example of a constellation schema architecture is depicted below.

Fact constellation schema DW architecture:

Data mart

Data mart contains the subset of organisation-wide data. This subset of data is valuable to specific group of an organisation. in other words we can say that data mart contains only that data which is specific to a particular group. For example the marketing data mart may contain only data related to item, customers and sales. The data mart are confined to subjects.

Points to remember about data marts:

window based or Unix/Linux based servers are used to implement data marts. They are implemented on low cost server.

The implementation cycle of data mart is measured in short period of time i.e. in weeks rather than months or years.

The life cycle of a data mart may be complex in long run if it's planning and design are not organisation-wide.

Data mart are small in size.

Data mart are customized by department.

The source of data mart is departmentally structured data warehouse.

Data mart are flexible.

Graphical Representation of data mart.

Online Analytical Processing Server (OLAP) is based on multidimensional data model. It allows the managers , analysts to get insight the information through fast, consistent, interactive access to information. In this chapter we will discuss about types of OLAP, operations on OLAP, Difference between OLAP and Statistical Databases and OLTP.

Types of OLAP Servers

We have four types of OLAP servers that are listed below.

         Relational OLAP(ROLAP)

         Multidimensional OLAP (MOLAP)

         Hybrid OLAP (HOLAP)

         Specialized SQL Servers

Relational OLAP(ROLAP)

The Relational OLAP servers are placed between relational back-end server and client front-end tools. To store and manage warehouse data the Relational OLAP use relational or extended-relational DBMS.

ROLAP includes the following.

                  implementation of aggregation navigation logic.

                  optimization for each DBMS back end.

                  additional tools and services.

Multidimensional OLAP (MOLAP)

Multidimensional OLAP (MOLAP) uses the array-based multidimensional storage engines for multidimensional views of data.With multidimensional data stores, the storage utilization may be low if the data set is sparse. Therefore many MOLAP Server uses the two level of data storage representation to handle dense and sparse data sets.

Hybrid OLAP (HOLAP)

The hybrid OLAP technique combination of ROLAP and MOLAP both. It has both the higher scalability of ROLAP and faster computation of MOLAP. HOLAP server allows to store the large data volumes of detail data. the aggregations are stored separated in MOLAP store

OLAP vs OLTP

SN

Data Warehouse (OLAP)

Operational Database(OLTP)

1

This involves historical processing of information.

This involves day to day processing.

2

OLAP systems are used by knowledge workers such as executive, manager and analyst.

OLTP system are used by clerk, DBA, or database professionals.

3

This is used to analysis the business.

This is used to run the business.

4

It focuses on Information out.

It focuses on Data in.

5

This is based on Star Schema, Snowflake Schema and Fact Constellation Schema.

This is based on Entity Relationship Model.

6

It focuses on Information out.

This is application oriented.

7

This contains historical data.

This contains current data.

8

This provides summarized and consolidated data.

This provide primitive and highly detailed data.

9

This provide summarized and multidimensional view of data.

This provides detailed and flat relational view of data.

10

The number or users are in Hundreds.

The number of users are in thousands.

11

The number of records accessed are in millions.

The number of records accessed are in tens.

12

The database size is from 100GB to TB

The database size is from 100 MB to GB.

13

This are highly flexible.

This provide high performance.

SCD - Slowly changing dimensions

Slowly changing dimensions (SCD) determine how the historical changes in the dimension tables are handled. Implementing the SCD mechanism enables users to know to which category an item belonged to in any given date

Types of Slowly Changing Dimensions in the Data Warehouse architectures:

Type 0 SCD is not used frequently, as it is classified as when no effort has been made to deal with the changing dimensions issues. So, some dimension data may be overwritten and other may stay unchanged over the time and it can result in confusing end-users.

Type 1 SCD DW architecture applies when no history is kept in the database. The new, changed data simply overwrites old entries. This approach is used quite often with data which change over the time and it is caused by correcting data quality errors (misspells, data consolidations, trimming spaces, language specific characters).
Type 1 SCD is easy to maintain and used mainly when losing the ability to track the old history is not an issue.

In the Type 2 SCD model the whole history is stored in the database. An additional dimension record is created and the segmenting between the old record values and the new (current) value is easy to extract and the history is clear. The fields 'effective date' and 'current indicator' are very often used in that dimension.

Type 3 SCD - only the information about a previous value of a dimension is written into the database. An 'old 'or 'previous' column is created which stores the immediate previous attribute. In Type 3 SCD users are able to describe history immediately and can report both forward and backward from the change.
However, that model can't track all historical changes, such as when a dimension changes twice or more. It would require creating next columns to store historical data and could make the whole data warehouse schema very complex.

Type 4 SCD idea is to store all historical changes in a separate historical data table for each of the dimensions.

DIMENSION TABLE

A dimension table is a table in a star schema of a data warehouse. A dimension table stores attributes, or dimensions, that describe the objects in a fact table.

A data warehouse organizes descriptive attributes as columns in dimension tables. For example, a customer dimension’s attributes could include first and last name, birth date, gender, etc., or a website dimension would include site name and URL attributes.

A dimension table consists of the attributes about the facts. Dimensions store the textual descriptions of the business. With out the dimensions, we cannot measure the facts. The different types of dimension tables are explained in detail below.
Conformed Dimension:
Conformed dimensions mean the exact same thing with every possible fact table to which they are joined.

Eg: The date dimension table connected to the sales facts is identical to the date dimension connected to the inventory facts.

Junk Dimension:
A junk dimension is a collection of random transactional codes flags and/or text attributes that are unrelated to any particular dimension. The junk dimension is simply a structure that provides a convenient place to store the junk attributes.

Eg: Assume that we have a gender dimension and marital status dimension. In the fact table we need to maintain two keys referring to these dimensions. Instead of that create a junk dimension which has all the combinations of gender and marital status (cross join gender and marital status table and create a junk table). Now we can maintain only one key in the fact table.

Degenerated Dimension:

A degenerate dimension is a dimension which is derived from the fact table and doesn't have its own dimension table.

Eg: A transactional code in a fact table.

Role-playing dimension:

Dimensions which are often used for multiple purposes within the same database are called role-playing dimensions. For example, a date dimension can be used for “date of sale", as well as "date of delivery", or "date of hire".

FACT TABLE

A fact table is the one which consists of the measurements, metrics or facts of business process. These measurable facts are used to know the business value and to forecast the future business. The different types of facts are explained in detail below.

Additive:

Additive facts are facts that can be summed up through all of the dimensions in the fact table. A sales fact is a good example for additive fact.

Semi-Additive:

Semi-additive facts are facts that can be summed up for some of the dimensions in the fact table, but not the others.
Eg: Daily balances fact can be summed up through the customers dimension but not through the time dimension.

Non-Additive:

Non-additive facts are facts that cannot be summed up for any of the dimensions present in the fact table.
Eg: Facts which have percentages, ratios calculated.

Factless Fact Table:

In the real world, it is possible to have a fact table that contains no measures or facts. These tables are called "Factless Fact tables".

Eg: A fact table which has only product key and date key is a factless fact. There are no measures in this table. But still you can get the number products sold over a period of time.

A fact tables that contain aggregated facts are often called summary tables

Based on the above classifications, fact tables are categorized into two:

Cumulative:

This type of fact table describes what has happened over a period of time. For example, this fact table may describe the total sales by product by store by day. The facts for this type of fact tables are mostly additive facts. The first example presented here is a cumulative fact table.

Snapshot:

This type of fact table describes the state of things in a particular instance of time, and usually includes more semi-additive and non-additive facts. The second example presented here is a snapshot fact table.

Learn Informatica

Wednesday, June 18, 2014

DATA WAREHOUSE CONCEPT

Data Warehouse Tools and Utilities Functions

STAR SCHEMA

Fact table

Dimension table

Data mart

Points to remember about data marts:

window based or Unix/Linux based servers are used to implement data marts. They are implemented on low cost server.

The implementation cycle of data mart is measured in short period of time i.e. in weeks rather than months or years.

The life cycle of a data mart may be complex in long run if it's planning and design are not organisation-wide.

Types of OLAP Servers

We have four types of OLAP servers that are listed below.

         Relational OLAP(ROLAP)

         Multidimensional OLAP (MOLAP)

         Hybrid OLAP (HOLAP)

         Specialized SQL Servers

Relational OLAP(ROLAP)

Multidimensional OLAP (MOLAP)

Hybrid OLAP (HOLAP)

The hybrid OLAP technique combination of ROLAP and MOLAP both. It has both the higher scalability of ROLAP and faster computation of MOLAP. HOLAP server allows to store the large data volumes of detail data. the aggregations are stored separated in MOLAP store

OLAP vs OLTP

SCD - Slowly changing dimensions

DIMENSION TABLE

1 comment:

Text Widget

Labels

Learn Informatica

Wednesday, June 18, 2014

DATA WAREHOUSE CONCEPT

Data Warehouse Tools and Utilities Functions

STAR SCHEMA

Fact table

Dimension table

Data mart

Points to remember about data marts:

window based or Unix/Linux based servers are used to implement data marts. They are implemented on low cost server.

The implementation cycle of data mart is measured in short period of time i.e. in weeks rather than months or years.

The life cycle of a data mart may be complex in long run if it's planning and design are not organisation-wide.

Types of OLAP Servers

We have four types of OLAP servers that are listed below. Relational OLAP(ROLAP) Multidimensional OLAP (MOLAP) Hybrid OLAP (HOLAP) Specialized SQL Servers

Relational OLAP(ROLAP)

Multidimensional OLAP (MOLAP)

Hybrid OLAP (HOLAP)

The hybrid OLAP technique combination of ROLAP and MOLAP both. It has both the higher scalability of ROLAP and faster computation of MOLAP. HOLAP server allows to store the large data volumes of detail data. the aggregations are stored separated in MOLAP store

OLAP vs OLTP

SCD - Slowly changing dimensions

DIMENSION TABLE

1 comment:

Text Widget

Labels

We have four types of OLAP servers that are listed below.

Relational OLAP(ROLAP)

Multidimensional OLAP (MOLAP)

Hybrid OLAP (HOLAP)

Specialized SQL Servers