What do You Meant by DataWarehouse.
Data
Warehouse is a central managed and integrated database containing data from the
operational sources in an organization (such as SAP, CRM, ERP system). It may
gather manual inputs from users determining criteria and parameters for
grouping or classifying records.
A source for the data warehouse is a data extract from operational databases. The data is validated, cleansed, transformed and finally aggregated and it becomes ready to be loaded into the data warehouse.
A source for the data warehouse is a data extract from operational databases. The data is validated, cleansed, transformed and finally aggregated and it becomes ready to be loaded into the data warehouse.
That database contains structured data for query analysis and can be accessed by users. The data warehouse can be created or updated at any time, with minimum disruption to operational systems. It is ensured by a strategy implemented in a ETL process.
Data warehouse is a dedicated database which contains detailed, stable, non-volatile and consistent data which can be analyzed in the time variant.
Sometimes, where only a portion of detailed data is required, it may be worth considering using a data mart. A data mart is generated from the data warehouse and contains data focused on a given subject and data that is frequently accessed or summarized.
Business
Intelligence - Data Warehouse - ETL:
Keeping the data warehouse filled
with very detailed and not efficiently selected data may lead to growing the
database to a huge size, which may be difficult to manage and unusable. To
significantly reduce number of rows in the data warehouse, the data is
aggregated which leads to easier data maintenance and efficiency.
Key Data Warehouse systems and the most widely used database engines for storing and serving data for the enterprise business intelligence and performance management:
Key Data Warehouse systems and the most widely used database engines for storing and serving data for the enterprise business intelligence and performance management:
·
Teradata
·
Oracle
·
SAP BW - Business Information Warehouse (SAP Net weaver BI)
·
Microsoft SQL Server
·
IBM DB2 (Info sphere Warehouse)
·
SAS
Data warehouse is subject Oriented, Integrated, Time-Variant and nonvolatile collection of data that support of management's decision making process. Let's explore this Definition of data warehouse.
Data Warehouse Tools and Utilities Functions
The following are the functions
of Data Warehouse tools and Utilities:
- Data Extraction - Data Extraction involves gathering the data from multiple heterogeneous sources.
- Data Cleaning - Data Cleaning involves finding and correcting the errors in data.
- Data Transformation - Data Transformation involves converting data from legacy format to warehouse format.
- Data Loading - Data Loading involves sorting, summarizing, consolidating, checking integrity and building indices and partitions.
- Refreshing - Refreshing involves updating from data sources to warehouse.
Note: Data Cleaning and Data Transformation are important steps in improving the
quality of data and data mining results.
Data warehouse is subject Oriented, Integrated, Time-Variant and nonvolatile collection of data that support of management's decision making process. Let's explore this Definition of data warehouse.
- Subject Oriented - The Data warehouse is subject oriented because it provide us the information around a subject rather the organization's ongoing operations. These subjects can be product, customers, suppliers, sales, revenue etc. The data warehouse does not focus on the ongoing operations rather it focuses on modelling and analysis of data for decision making.
- Integrated - Data Warehouse is constructed by integration of data from heterogeneous sources such as relational databases, flat files etc. This integration enhance the effective analysis of data.
- Time-Variant - The Data in Data Warehouse is identified with a particular time period. The data in data warehouse provide information from historical point of view.
- Non Volatile - Non volatile means that the previous data is not removed when new data is added to it. The data warehouse is kept separate from the operational database therefore frequent changes in operational database is not reflected in data warehouse.
- Metadata - Metadata is simply defined as data about data. The data that are used to represent other data is known as metadata. For example the index of a book serve as metadata for the contents in the book.In other words we can say that metadata is the summarized data that lead us to the detailed data.
In terms of data warehouse we can
define metadata as following:
- Metadata is a road map to data warehouse.
- Metadata in data warehouse define the warehouse objects.
- The metadata act as a directory.This directory helps the decision support system to locate the contents of data warehouse.
DataWarehouse Architecture
The main difference between the database
architecture in a standard, on-line transaction processing oriented system
(usually ERP or CRM system) and a DataWarehouse is that the system’s relational
model is usually de-normalized into dimension and fact tables which are typical
to a data warehouse database design.
The differences in the database architectures are caused by different purposes of their existence.
In a typical OLTP system the database performance is crucial, as end-user interface responsiveness is one of the most important factors determining usefulness of the application. That kind of a database needs to handle inserting thousands of new records every hour. To achieve this usually the database is optimized for speed of Inserts, Updates and Deletes and for holding as few records as possible. So from a technical point of view most of the SQL queries issued will be INSERT, UPDATE and DELETE.
Opposite to OLTP systems, a Data Warehouse is a system that should give response to almost any question regarding company performance measure. Usually the information delivered from a data warehouse is used by people who are in charge of making decisions. So the information should be accessible quickly and easily but it doesn't need to be the most recent possible and in the lowest detail level.
Usually the data warehouses are refreshed on a daily basis (very often the ETL processes run overnight) or once a month (data is available for the end users around 5th working day of a new month). Very often the two approaches are combined.
The main challenge of a Data Warehouse architecture is to enable business to access historical, summarized data with a read-only access of the end-users. Again, from a technical standpoint the most SQL queries would start with a SELECT statement.
The differences in the database architectures are caused by different purposes of their existence.
In a typical OLTP system the database performance is crucial, as end-user interface responsiveness is one of the most important factors determining usefulness of the application. That kind of a database needs to handle inserting thousands of new records every hour. To achieve this usually the database is optimized for speed of Inserts, Updates and Deletes and for holding as few records as possible. So from a technical point of view most of the SQL queries issued will be INSERT, UPDATE and DELETE.
Opposite to OLTP systems, a Data Warehouse is a system that should give response to almost any question regarding company performance measure. Usually the information delivered from a data warehouse is used by people who are in charge of making decisions. So the information should be accessible quickly and easily but it doesn't need to be the most recent possible and in the lowest detail level.
Usually the data warehouses are refreshed on a daily basis (very often the ETL processes run overnight) or once a month (data is available for the end users around 5th working day of a new month). Very often the two approaches are combined.
The main challenge of a Data Warehouse architecture is to enable business to access historical, summarized data with a read-only access of the end-users. Again, from a technical standpoint the most SQL queries would start with a SELECT statement.
In Data Warehouse environments, the relational
model can be transformed into the following architectures:
1.STAR SCHEMA
2.SNOWFLAKE SCHEMA
3.CONSTELLATION SCHEMA
STAR SCHEMA
Star schema architecture is the
simplest data warehouse design. The main feature of a star schema is a table at
the center, called the fact table and the dimension tables which
allow browsing of specific categories, summarizing, drill-downs and specifying
criteria.
Typically, most of the fact tables in a star schema are in database third normal form, while dimensional tables are de-normalized (second normal form).
Despite the fact that the star schema is the simpliest datawarehouse architecture, it is most commonly used in the datawarehouse implementations across the world today (about 90-95% cases).
Typically, most of the fact tables in a star schema are in database third normal form, while dimensional tables are de-normalized (second normal form).
Despite the fact that the star schema is the simpliest datawarehouse architecture, it is most commonly used in the datawarehouse implementations across the world today (about 90-95% cases).
Fact table
The primary key for the fact table is a composite of all the columns except numeric values / scores (like QUANTITY, TURNOVER, exact invoice date and time).
Typical fact tables in a global enterprise data warehouse are (usually there may be additional company or business specific fact tables):
·
sales fact table - contains all details regarding sales
·
orders fact table - in some cases the table can be split into
open orders and historical orders. Sometimes the values for historical orders
are stored in a sales fact table.
·
budget fact table - usually grouped by month and loaded once at
the end of a year.
·
forecast fact table - usually grouped by month and loaded daily,
weekly or monthly.
·
inventory fact table - report stocks, usually refreshed daily
Dimension table
Nearly all of the information in a typical fact table is also present in one or
more dimension tables. The main purpose of maintaining Dimension Tables is to
allow browsing the categories quickly and easily.
The primary keys of each of the dimension tables are linked together to form the composite primary key of the fact table. In a star schema design, there is only one de-normalized table for a given dimension.
Typical dimension tables in a data warehouse are:
The primary keys of each of the dimension tables are linked together to form the composite primary key of the fact table. In a star schema design, there is only one de-normalized table for a given dimension.
Typical dimension tables in a data warehouse are:
·
time dimension table
·
customers dimension table
·
products dimension table
·
key account managers (KAM) dimension table
·
sales office dimension table
Star schema DW architecture:
SNOWFLAKE SCHEMA
Snowflake schema architecture is a more complex
variation of a star schema design.
The main difference is that dimensional tables in a snowflake schema are
normalized, so they have a typical relational database design.
Snowflake schemas are generally used when a dimensional table becomes
very big and when a star schema can’t represent the complexity of a data
structure.
For example if a PRODUCT dimension table contains millions of rows, the use of
snowflake schemas should significantly improve performance by moving out some
data to other table (with BRANDS for instance).
The problem is that the more normalized the dimension table is, the more
complicated SQL joins must be issued to query them. This is because in order
for a query to be answered, many tables need to be joined and aggregates generated.
An example of a snowflake schema architecture is depicted below.
Snowflake schema DW architecture:
Fact constellation schema
For each star schema or snowflake schema it is
possible to construct a fact constellation schema.
This schema is more complex than star or snowflake architecture, which
is because it contains multiple fact tables. This allows dimension tables to be
shared amongst many fact tables.
That
solution is very flexible, however it may be hard to manage and support.
The main disadvantage of the fact constellation schema is a more complicated design because many variants of aggregation must be considered.
In a fact constellation schema, different fact tables are explicitly assigned to the dimensions, which are for given facts relevant. This may be useful in cases when some facts are associated with a given dimension level and other facts with a deeper dimension level.
Use of that model should be reasonable when for example, there is a sales fact table (with details down to the exact date and invoice header id) and a fact table with sales forecast which is calculated based on month, client id and product id.
In that case using two different fact tables on a different level of grouping is realized through a fact constellation model.
An example of a constellation schema architecture is depicted below.
The main disadvantage of the fact constellation schema is a more complicated design because many variants of aggregation must be considered.
In a fact constellation schema, different fact tables are explicitly assigned to the dimensions, which are for given facts relevant. This may be useful in cases when some facts are associated with a given dimension level and other facts with a deeper dimension level.
Use of that model should be reasonable when for example, there is a sales fact table (with details down to the exact date and invoice header id) and a fact table with sales forecast which is calculated based on month, client id and product id.
In that case using two different fact tables on a different level of grouping is realized through a fact constellation model.
An example of a constellation schema architecture is depicted below.
Fact
constellation schema DW architecture:
Data mart
Data mart contains the subset of
organisation-wide data. This subset of data is valuable to specific group of an
organisation. in other words we can say that data mart contains only that data
which is specific to a particular group. For example the marketing data mart
may contain only data related to item, customers and sales. The data mart are
confined to subjects.
Points to remember about data marts:
window based or Unix/Linux based servers are used to implement data marts. They are implemented on low cost server.
The implementation cycle of data mart is measured in short period of time i.e. in weeks rather than months or years.
The life cycle of a data mart may be complex in long run if it's planning and design are not organisation-wide.
Data
mart are small in size.
Data mart are customized by
department.
The source of data mart is
departmentally structured data warehouse.
Data mart are flexible.
Graphical Representation of data
mart.
Online Analytical Processing
Server (OLAP) is based on multidimensional data model. It allows the managers ,
analysts to get insight the information through fast, consistent, interactive
access to information. In this chapter we will discuss about types of OLAP,
operations on OLAP, Difference between OLAP and Statistical Databases and OLTP.
Types of OLAP Servers
We have four types of OLAP
servers that are listed below.
Relational OLAP(ROLAP)
Multidimensional OLAP (MOLAP)
Hybrid OLAP (HOLAP)
Specialized SQL Servers
Relational OLAP(ROLAP)
The Relational OLAP servers are
placed between relational back-end server and client front-end tools. To store
and manage warehouse data the Relational OLAP use relational or
extended-relational DBMS.
ROLAP includes the following.
implementation of aggregation
navigation logic.
optimization for each DBMS back
end.
additional tools and services.
Multidimensional OLAP (MOLAP)
Multidimensional OLAP (MOLAP)
uses the array-based multidimensional storage engines for multidimensional
views of data.With multidimensional data stores, the storage utilization may be
low if the data set is sparse. Therefore many MOLAP Server uses the two level
of data storage representation to handle dense and sparse data sets.
Hybrid OLAP (HOLAP)
The hybrid OLAP technique
combination of ROLAP and MOLAP both. It has both the higher scalability of
ROLAP and faster computation of MOLAP. HOLAP server allows to store the large
data volumes of detail data. the aggregations are stored separated in MOLAP
store
OLAP vs OLTP
SN
|
Data Warehouse (OLAP)
|
Operational Database(OLTP)
|
1
|
This involves historical processing of
information.
|
This involves day to day processing.
|
2
|
OLAP systems are used by knowledge
workers such as executive, manager and analyst.
|
OLTP system are used by clerk, DBA, or
database professionals.
|
3
|
This is used to analysis the business.
|
This is used to run the business.
|
4
|
It focuses on Information out.
|
It focuses on Data in.
|
5
|
This is based on Star Schema, Snowflake
Schema and Fact Constellation Schema.
|
This is based on Entity Relationship
Model.
|
6
|
It focuses on Information out.
|
This is application oriented.
|
7
|
This contains historical data.
|
This contains current data.
|
8
|
This provides summarized and
consolidated data.
|
This provide primitive and highly
detailed data.
|
9
|
This provide summarized and
multidimensional view of data.
|
This provides detailed and flat
relational view of data.
|
10
|
The number or users are in Hundreds.
|
The number of users are in thousands.
|
11
|
The number of records accessed are in
millions.
|
The number of records accessed are in
tens.
|
12
|
The database size is from 100GB to TB
|
The database size is from 100 MB to GB.
|
13
|
This are highly flexible.
|
This provide high performance.
|
SCD - Slowly changing dimensions
Slowly changing dimensions (SCD) determine how the historical changes in the dimension tables are handled. Implementing the SCD mechanism enables users to know to which category an item belonged to in any given date
Types of Slowly Changing Dimensions in the Data Warehouse architectures:
- Type 0 SCD is not used frequently, as it is classified as when no effort has been made to deal with the changing dimensions issues. So, some dimension data may be overwritten and other may stay unchanged over the time and it can result in confusing end-users.
- Type 1 SCD DW architecture applies when no history is kept in the database. The new, changed data simply overwrites old entries. This approach is used quite often with data which change over the time and it is caused by correcting data quality errors (misspells, data consolidations, trimming spaces, language specific characters).
Type 1 SCD is easy to maintain and used mainly when losing the ability to track the old history is not an issue. - In the Type 2 SCD model the whole history is stored in the database. An additional dimension record is created and the segmenting between the old record values and the new (current) value is easy to extract and the history is clear. The fields 'effective date' and 'current indicator' are very often used in that dimension.
- Type 3 SCD - only the information about a previous value of a dimension is written into the database. An 'old 'or 'previous' column is created which stores the immediate previous attribute. In Type 3 SCD users are able to describe history immediately and can report both forward and backward from the change.
However, that model can't track all historical changes, such as when a dimension changes twice or more. It would require creating next columns to store historical data and could make the whole data warehouse schema very complex. - Type 4 SCD idea is to store all historical changes in a separate historical data table for each of the dimensions.
DIMENSION TABLE
A
dimension table is a table in a star schema of a data warehouse. A dimension
table stores attributes, or dimensions, that describe the objects in a fact
table.
A
data warehouse organizes descriptive attributes as columns in dimension
tables. For example, a customer dimension’s attributes could include
first and last name, birth date, gender, etc., or a website dimension would
include site name and URL attributes.
A dimension table consists of the attributes about the facts. Dimensions store
the textual descriptions of the business. With out the dimensions, we cannot
measure the facts. The different types of dimension tables are explained in
detail below.
Conformed Dimension:
Conformed dimensions mean the exact same thing with every possible fact table to which they are joined.
Eg: The date dimension table connected to the sales facts is identical to the date dimension connected to the inventory facts.
Junk Dimension:
A junk dimension is a collection of random transactional codes flags and/or text attributes that are unrelated to any particular dimension. The junk dimension is simply a structure that provides a convenient place to store the junk attributes.
Eg: Assume that we have a gender dimension and marital status dimension. In the fact table we need to maintain two keys referring to these dimensions. Instead of that create a junk dimension which has all the combinations of gender and marital status (cross join gender and marital status table and create a junk table). Now we can maintain only one key in the fact table.
Degenerated Dimension:
A degenerate dimension is a dimension which is derived from the fact table and doesn't have its own dimension table.
Eg: A transactional code in a fact table.
Role-playing dimension:
Dimensions which are often used for multiple purposes within the same database are called role-playing dimensions. For example, a date dimension can be used for “date of sale", as well as "date of delivery", or "date of hire".
Conformed Dimension:
Conformed dimensions mean the exact same thing with every possible fact table to which they are joined.
Eg: The date dimension table connected to the sales facts is identical to the date dimension connected to the inventory facts.
Junk Dimension:
A junk dimension is a collection of random transactional codes flags and/or text attributes that are unrelated to any particular dimension. The junk dimension is simply a structure that provides a convenient place to store the junk attributes.
Eg: Assume that we have a gender dimension and marital status dimension. In the fact table we need to maintain two keys referring to these dimensions. Instead of that create a junk dimension which has all the combinations of gender and marital status (cross join gender and marital status table and create a junk table). Now we can maintain only one key in the fact table.
Degenerated Dimension:
A degenerate dimension is a dimension which is derived from the fact table and doesn't have its own dimension table.
Eg: A transactional code in a fact table.
Role-playing dimension:
Dimensions which are often used for multiple purposes within the same database are called role-playing dimensions. For example, a date dimension can be used for “date of sale", as well as "date of delivery", or "date of hire".
FACT TABLE
A fact table is the one which
consists of the measurements, metrics or facts of business process. These
measurable facts are used to know the business value and to forecast the future
business. The different types of facts are explained in detail below.
Additive facts are facts that can be summed up through all of the dimensions in the fact table. A sales fact is a good example for additive fact.
Semi-Additive:
Semi-additive facts are facts that can be summed up for some of the dimensions in the fact table, but not the others.
Eg: Daily balances fact can be summed up through the customers dimension but not through the time dimension.
Non-Additive:
Non-additive facts are facts that cannot be summed up for any of the dimensions present in the fact table.
Eg: Facts which have percentages, ratios calculated.
Factless Fact Table:
In the real world, it is
possible to have a fact table that contains no measures or facts. These tables
are called "Factless Fact tables".
Eg: A fact table which has only product key and date key is a factless fact. There are no measures in this table. But still you can get the number products sold over a period of time.
A fact tables that contain aggregated facts are
often called summary tables
Cumulative:
This type of fact table describes what has happened over a period of time. For example, this fact table may describe the total sales by product by store by day. The facts for this type of fact tables are mostly additive facts. The first example presented here is a cumulative fact table.
Snapshot:
This type of fact table describes the state of things in a particular instance of time, and usually includes more semi-additive and non-additive facts. The second example presented here is a snapshot fact table.
Super! very easy to understand thanks Informatica Online Course
ReplyDelete