02 DWH Part 1

Buffl

Business Intelligence TU Wien

by nils K.

Typical use cases of DWH

Retail

customers loyaltiy. market planning

Manufacturing

cost reduction
logistics management

Finance

Risk management, fraud detection

Utilities

Asset mangement
Resource management

Airlines

Route profitablity
Yield management

Government

manpower planing
Cost controle

Recap: Data Warehouse Reference Architecture

What is Source DB

– DB of an application which supports one or more types of business transactions.

– Also: Operational Application, OLTP Application

Recap: Data Warehouse Reference Architecture

Landing Area (LA)

– DB that is able to store a single data extract of a subset of one Source DB.

– Data are typically accessible via APIs to the Source DB and follow a pre-defined schema of a subset of data.

Recap: Data Warehouse Reference Architecture

Staging Area (SA)

– DB that is able to store matching data extracts from various Landing Areas in an integrated format

– Upload to the DW from SA once data from all Landing Areas are available.

– Schema basically corresponds 1:1 with the DW schema

Recap: Data Warehouse Reference Architecture

Data Warehouse (DW)

– DB containing the history of all complete Staging Areas

– Its integrated schema is frequently still more or less in Third Normal Form (3NF)

Note: 3NF is a slight contradiction to the criterion “subject-orientation

Data Warehouse Reference Architecture

Data Mart (DM) – also: OLAP Application

– DB containing data describing the (present and past) performance of one or more types of business transactions, taken form the DW

– Schema of a Data Mart often has the form of one or more denormalized “stars”.

What is the Master Data in DWH?

„Factored out“ information required by many applications (OLTP and OLAP)

„Structured“ master data:

customers, vendors, maybe products, categorization attributes / codes

Metadata componenet

Name content &

name three types of metadata

Content :

Using

Administrterin

Complexity, size, - > impossible to adminsiter without metadata

Building

source siystem, source to target mappin, transformation rules etc.

1. Operational metadata

2. Extraction and transformation metadata

3. End-user metadata

Conformed Constellation of Data Marts

(Bus Architecture)

Source DB -> Landing Area -> ??

Characteristics

§ No “explicit” DW, no common schema

§ DW bus: conformed facts & dimensions

§ Common mapping to dimensions through Master Data store(s)

Integrated Data Warehouse Architecture

Only Source DB -> Staging Area + DW integrated

Master Data just another source DB"

All Data Marts fed from the integrated Data Warehouse

Whats the Process steps and milestones in ETL PRocess?

Extract: Data is extracted from one or more sources, such as databases, file systems, or APIs.
Transform: Data is transformed into a format that is suitable for the target system, often through processes such as cleansing, standardization, and aggregation.
Load: Data is loaded into the target system, such as a data warehouse or data lake.

Milestones: Extract, Clean, Conform, Deliver

ETL Extrqaction compoenent, what are monitoring strategies?

Snapshot-based:

– periodically copy complete data sets into a file

– compare snapshots to detect changes

Timestamp-based:

– update timestamps of tuples upon change

– compare timestamps and transfer changed tuples

Log-based: analyze the DMBS’s transaction log to detect changes

Trigger-based: active mechanisms lead to extraction upon change in source data

Replication based: replication mechanism to transfer changed data

Transformation component in ETL

Prepare the data

Structural preparation: schema integration

Content: Data Cleaning and Integration

Steps in Transformation ETL

1. Transform data into a consistent format

2. Data Cleaning: handle

§ missing values

§ redundant values

§ inconsistent values

Data Scrubbing: use domain knowledge (business rules) to detect “dirty”
Data Auditing
§ Judge the quality of the resulting data

Name the 3 Classes of Filtering in the four transofmration layer: Filtering, Harmonization, Aggregation, Enrichment

Class1 - automatic detection, automatic

correction

CLass2 - automatic detection, manual

correction

Class 3 - manual detection, manual

correction

What are the Four Transformation Layers in ETL

Filtering, Harmonization, Aggregation, Enrichment

Data Marts

“stand-alone”
Analyse the performacen of a single business process
less cokplex and easier to underand
faster queries

Does the fact table of a data mart has many rows or many columsn?

Many columns with data which is easy to understand.

Difference between Fact and Dimensions in a Datamart?

Im Allgemeinen stellen Fakten den Kern des Data Mart dar und bilden die Basis für Analyse- und Berichtszwecke. Dimensionen dienen hingegen dazu, Fakten in einen größeren Kontext zu setzen und sie auf verschiedene Weise zu segmentieren und zu kategorisieren.

Fakten sind Daten

quantitative Informationen enthalten
gemessen, aggregiert oder berechnet werden können
Beispiele für Fakten in einem Data Mart sind Verkaufszahlen, Umsätze und Gewinne.

Dimensionen sind Daten

die kontextualisierende Informationen bereitstellen
häufig zur Segmentierung oder Klassifizierung von Fakten verwendet werden.
Beispiele für Dimensionen in einem Data Mart sind Kunden, Produkte, Zeit und Standorte.

Question: What is the snowflake schema? Pros and Cons also pls

Answer: The snowflake schema is a type of database schema that is used to organize large amounts of data in a data warehouse. It is called a "snowflake" schema because the diagram of the schema resembles a snowflake.

Pros:

+ Small savings in storage space

+ Normalized structures are easier to update and maintain

Cons:

- Schema becomes less intuitive and end-users are put off by the complexity

- Browsing through contents becomes more difficult

Name the 5 desired propeties of a OLAP System “FASMI”

1. Fast: regular queries within 5 sec, complex ones within 20 sec

2. Analysis: intuitive analysis and arbitrary calculations

3. Shared: effective access control in a multi-user environment

4. Multidimensional: on a conceptual level, data is presented as a

multidimensional view (irrespective of the actual implementation of

the underlying data structure)

5. Information: Scalability and acceptable response time for large data

In an OLAP cube, typical operations on pre-joined and often pre-aggregated tables include:

Roll-up: This operation aggregates data from a lower level of granularity to a higher level, such as from monthly sales data to annual sales data.
Drill-down: This operation breaks down data from a higher level of granularity to a lower level, such as from annual sales data to monthly sales data.
Slice and dice: This operation involves selecting and filtering data based on specific dimensions or measures, such as selecting only sales data for a specific product line or region.
Ranking: top performers, top risks etc.

R-OLAP:

M-OLAP:

H-OLAP:

R-OLAP: Relational OLAP

M-OLAP: Multi-Dimensional OLAP

H-OLAP: Hybrid OLAP -> Combination of R-OLAP and M-OLAP

R-OLAP = Relational OLAP difference to M-OLAP = Multi-dimensional OLAP

M-OLAP = Multi-dimensional OLAP

Proprietary DBMSs optimized for storage of multidimensional data

+ Response time (runtime efficiency)

+ Smaller storage footprint

(due to better compression)

+ Query performance may be better

+ More powerful specialized techniques for multi-dimensional data than general purpose DBMSs

+ More suitable for complex calculations that don’t translate well into SQL

R-OLAP = Relational OLAP

Star- and Snowflake-Schemas in classic, relational DBMSs

+ Compatibility → can be accessed by any SQL reporting tool

+ Flexibility, not limited to cube structure

(BUT: don’t reuse OLTP tables!)

+ No pre-computation of cubes, aggregations calculated upon request

+ More scalable for very large data volumes

+ Shorter load times

+ Costs may be lower

+ Better scalability for high cardinality dimensions

H-OLAP = Hybrid OLAP

Join Course

Preview

Author

nils K.

Information

Last changed
3 years ago

Report course