Data Warehouse Concepts Presentation

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 60

Data Warehousing Concept

 What is Data Warehousing ?


 Why we need Data Warehousing ?
 What is Data Mart?
 Data Warehouse Vs Data Mart
 Data Warehousing Architecture
 Data Modeling
Data Warehousing
What is Data Warehousing?

Enterprise wide collection of Historical


data which helps the management to
take right decision at right time
or
A Process of transforming data into
information making it available to
users in timely enough manner for
Organizational decisions Making.
Data Warehousing

Operational Data Store


Data Warehousing
Data Warehousing Concept
Data Warehousing

Organization Hierarchy and Decision Making

Top
Management

Middle Management

Low Level Management

 Decision Making - Killer in Stint / Quick in Action


 “Speed of thoughts”
Data Warehousing
Data Warehousing

 High query performance


 Queries not visible outside warehouse
 Local processing at sources unaffected
 Can operate when sources unavailable
 Extra information at warehouse
 Modify, summarize (store aggregates)
 Add historical information
Definition
 Business Oriented / Subject oriented
 Integrated
 Time-Varying
 Non-Volatile
 Aimed at executive, decision maker
 Often a copy of operational data
 Business oriented / Subject oriented
 Data warehouses are designed to help you analyze data.

 This ability to define a data warehouse by subject matter,


sales in this case, makes the data warehouse subject
oriented.

 Example - Using this data warehouse,


you can answer questions such as
 "Who was our best customer for this item last year?"
or
 "Who is likely to be our best customer next year?“
 Integrated

 Data warehouses must put data from heterogeneous


sources into a consistent format

 They must resolve such problems as naming conflicts and


inconsistencies among units of measure

 When they achieve this, they are said to be integrated.


 Non volatile

Nonvolatile means that, once entered into the


data warehouse, data should not change.

This is logical because the purpose of a data


warehouse is to enable you to analyze what has
occurred over the time period.
 Time Variant

 A data warehouse's focus on change over time is what is


meant by the term time variant.

 In order to discover trends and identify hidden patterns and


relationships in business, analysts need large amounts of
data.

 This is very much in contrast to online transaction processing


(OLTP) systems, where performance requirements demand
that historical data be moved to an archive.
Data Warehousing

 Aimed at executive, decision maker


Data Warehousing

 A data mart stores data for a limited number of


subject areas, such as marketing and sales data. It is
used to support specific applications.

 An independent data mart is created directly from


source systems.

 A dependent data mart is populated from a data


warehouse.
Data Warehousing

Data Warehouse, Metadata Repository and Data Mart

Finance Data Mart

Enterprise
Data
Marketing Data Mart
Warehouse

Sales Data Mart

Metadata Repository
Production Data Mart
Data Warehousing

 Data Mart is a subset of Data Warehouse.

 Mart is subject centric and Warehouse is business


centric.

 Data Mart is a physical or logical part of Data


Warehouse.
Data Warehousing

6. Data Warehouse Architecture


Data Warehousing

 On Line Transaction Processing


(OLTP)

 On Line Analytical Processing


(OLAP)
Data Warehousing

 Operational System
 Highly Volatile Data
 Generally normalized to 3rd normal form
 Direct user interaction system
 Very less redundancy (Almost 0)
Data Warehousing

 Decision Support System


 Non Volatile
 Integrated
 De-normalized to 2nd or 1st Normal Form
 Deals with more no. of records
 Fast in Retrieval of Records
 Redundant Data
Data Warehousing

“Data about DATA”.

 Any data maintained to support the operations or use of a


data warehouse, similar to an encyclopedia for the data
warehouse.

 Data extraction/transformation history.


 Column aliases.
 Data warehouse table sizes.
 Data summarization/modeling algorithms.
 Data usage statistics.
Data Warehousing

Source
Systems

Data BI Tools
ETL DQ Warehouse
Query Tools

OLAP Tools

Data Mining

Data
Visualization
Data Warehousing

Source Systems Data and Metadata Presentation


ETL Layer
Repository Layer Layer
Execution
Systems

• CRM ODS
Extract, Enterprise
• ERP Transformation, Data Reporting
• Legacy and Load (ETL) Tools
Warehouse Data Mart
• e-Commerce Layer
OLAP Tools
External • Cleanse Data
Data Mart
• Standardize Values Ad Hoc
Data Query Tools
• Apply Business Rules Metadata
• Purchased • Merge Records Repository
Data Mart Data Mining
Market Data Tools
• Spreadsheets
Data Warehousing

Data and Metadata Presentation


Source Systems ETL Layer
Repository Layer Layer

Reporting Tools

• ORACLE • Oracle • Business Objects


• DB2 Warehouse Builder • IBM DB2 • Crystal Reports
• Excel • Cognos
• Informatica • Informix Dynamic Server
• Flat Files • IBM - Visual • Microsoft SQL server OLAP Tools
• Informix Warehouse • ORACLE 9i
• dbase • Microsoft – DTS • Teradata DBS • Cognos
• SQL Server (SSIS) • Business Objects
• Access • Microstrategy
•SAS- Warehouse
Administrator Data Mining Tools
• SAS - Enterprise
Miner
Data Warehousing

 ETL Process
 The extraction of data from many heterogeneous
systems
 The transformation of this extracted data into structures
and types that follow the business rules of the data
warehouse
 The loading of this transformed (cleansed) data into the
data warehouse structures in preparation for data analysis
Data Warehousing

 ETL Process
 The ETL design process is perhaps the most
time consuming stage of the Data Warehouse
project.
 It is often the case that over 50% of the time
dedicated to the Data Warehousing project is
spent on designing and developing the ETL
processes.
 Your ETL processes will determine the quality of
data that ends up in your Data Warehouse.
Data Warehousing

 ETL Process Sequence


 Data Extraction
 Verification
 Data Cleansing
 Integration
 Aggregation
 Loading
Data Warehousing

 Data Extraction

This is the process that captures the data from the


source systems and moves it to the staging
database.

The extraction can be


 Full refresh where all the specified data from the source system
is copied
 Incremental update which only copies the specified data that
has changed since the last time the ETL process was run.
Data Warehousing

 Data Cleaning
 Find and Replace – for instance to synchronize a
building name where there were instances of the
same building referred to under different
abbreviations – i.e. 'London Health Centre', 'London
HC', 'London Hth Cen' etc
 Convert Case – for example on the 'title' column of
a 'customers' table, converting all instances of
'MRS', 'mrs' and 'MRs' to 'Mrs'
 Merging data from different data sources
 NULL value handling – conversion to a default value
 Data Type conversion – to synchronize data from
different systems, i.e. the CustomerId in one
Data Warehousing

 Integration
Client Client

Query & Analysis

Metadata Warehouse

Paper Reports
Integration
• Periodic
•On-demand

oracle MS SQL DB2


Source Source Source
Data Warehousing

 Aggregation
Once we have tables that are ready for loading
into the DW we can perform summary calculations
(aggregations) and store this summary data to
enable quicker running of queries.
When creating our dimensional data model it is
essential that good paths of aggregation form
part of the design of the dimension tables.
Data Warehousing

 Loading Data
 Incremental vs. refresh
 Off-line vs. on-line
 Frequency of loading
 At night, 1x a week/month, continuously
 Informatica
 Abinito
 Oracle Express / Warehouse Builder
 MS-DTS from Microsoft (SSIS)
 DataStage from Ascential Software
 SAS System from SAS Institute
Data Modeling

Definition:

 The specification of data structures and business rules to represent business


requirements.
Data Modeling

Different Data Models :


CUSTOMER
Subject Area Model
(Conceptual)
SALES
DETAIL PRODUCT

SALES_DETAIL
CUSTOMER
SALES_RECORD_ID
CUSTOMER_ID
Sales Detail Customer CUSTOMER_ID
Sales Record PRODUCT_SKU
Customer ID
ID

Logical Physical
PRODUCT
Model Model
Product PRODUCT_SKU
Product SKU
Data Modeling

Types of Data Models


 High-level or Conceptual data models
 Provide concepts that are close to the way many users perceive data
 Use concepts such as entities, attributes and relationships
 Logical or Representational or Implementation or Record-based
data models
 Provide concepts that may be understood by end users as well as that
can be implemented on a computer system
 Business terminology is used. Independent of physical data base
implementation.
 Examples: Relational, Network, Hierarchical data models
 Low-level or Physical data models
 Provide concepts that describe details of how data is stored in the
computer
Data Modeling

Data Modeling Approaches in DWH


◦ Relational Approach (E-R Modeling)
 Traditional modeling technique
 Technique of choice for OLTP
 Suits for corporate data Warehouse
 Goals
 Eliminate redundancy
 Transaction efficiency
◦ Dimensional Approach(Dimensional Modeling)
 Analyzing business measures in the specific business context
 Helps visualize very abstract business questions
 End users can easily understand and navigate the data structure
 Goals
 Query performance
 Ease of understanding / use
Data Modeling

ER vs. Dimensional Models


 Used for transactional  Used for analytical processing.
processing.
 Removes data redundancy  Never bothered about
and keeps data consistent. redundancy, all history is
maintained.
 Less tables, less relations.
 More tables, more relations. Good for querying.
Hence querying become
difficult.
 Master,child,sub-type and  Dimension and fact tables
associative tables found. found.

 Normalized Structures
 Denormalized structures
Data Modeling
Data Modeling
Data Modeling
Dimension Model Overview

Dimension2 Dimension3

Fact ••

Dimension1 Dimensionn
Data Modeling
Dimension Model
 Same information as a ‘transactional’ database in 3NF

 Goals
◦ User understandability
◦ Query performance

 Data should be atomic


◦ At the lowest level users should not need to “break apart” data structures
to dig in deeper
Data Modeling
Dimension Model
 Dimension
 A Category of Information
 Meaning of Facts / Measures
Ex: Time, Product, Region
Data Modeling
Dimension Model

 Measure / Fact :
 A business performance measurement, typically numeric
and additive, that is stored in a fact table.

 Types of Fact

 Additive
 Semi Additive
 Non Additive
Dimension Model
 Conforming the dimensions
 Common dimensions across the Facts/ data marts have
to be exactly same or subset of the main dimension table
Dimension Model
 Conforming the dimensions in Matrix View
Dimension Model
 Types of Facts
 Additive
Facts / Measures which are in numeric format and can be shown in
aggregations like Sum, Avg., etc.
For example, Unit Quantity and Sales Amount.

 Semi Additive
Facts / Measures which are in numeric format but can not be
calculated like additives example Bank Balances, Inventory levels
etc.

 Non Additive :
Non-additive facts are facts that must be calculated at each level of
aggregation; that is, they can not be directly summed.
For example, Age and Temperature.
Schema and Types

What is Schema ?
Logical or Physical design of a set of Database Tables,
indicating the relationship among the tables.

Schema Types
 STAR SCHEMA

 SNOW FLAKE SCHEMA

 HYBRID SCHEMA
Star Schema
Star Schema
Star Schema
Star Schema
Data Modeling
Star Schema
STAR SCHEMA
 Simple and easy overview -> ease-of-use
 Relatively flexible
 Fact table is normalized
 Dimension tables often relatively small
 “Recognized” by many RDBMSes -> good Performance
 Hierarchies are ”hidden” in the columns
 Dimension tables are de-normalized
Data Modeling
Snow Flake Schema
Snow Flake Schema
Data Modeling
Snow Flake Schema
Snow Flake Schema
Data Modeling
Hybrid Schema
Hybrid Schema
Data Modeling
Snow Flake Schema

 SNOW FLAKE SCHEMA


 Hierarchies are made explicit/visible
 Very flexible
 Dimension tables use less space
Data Modeling
Thank You !

You might also like