Typical use cases of DWH
Retail
customers loyaltiy. market planning
Manufacturing
cost reduction
logistics management
Finance
Risk management, fraud detection
Utilities
Asset mangement
Resource management
Airlines
Route profitablity
Yield management
Government
manpower planing
Cost controle
Recap: Data Warehouse Reference Architecture
What is Source DB
– DB of an application which supports one or more types of business transactions.
– Also: Operational Application, OLTP Application
Landing Area (LA)
– DB that is able to store a single data extract of a subset of one Source DB.
– Data are typically accessible via APIs to the Source DB and follow a pre-defined schema of a subset of data.
Staging Area (SA)
– DB that is able to store matching data extracts from various Landing Areas in an integrated format
– Upload to the DW from SA once data from all Landing Areas are available.
– Schema basically corresponds 1:1 with the DW schema
Data Warehouse (DW)
– DB containing the history of all complete Staging Areas
– Its integrated schema is frequently still more or less in Third Normal Form (3NF)
Note: 3NF is a slight contradiction to the criterion “subject-orientation
Data Warehouse Reference Architecture
Data Mart (DM) – also: OLAP Application
– DB containing data describing the (present and past) performance of one or more types of business transactions, taken form the DW
– Schema of a Data Mart often has the form of one or more denormalized “stars”.
What is the Master Data in DWH?
„Factored out“ information required by many applications (OLTP and OLAP)
„Structured“ master data:
customers, vendors, maybe products, categorization attributes / codes
Metadata componenet
Name content &
name three types of metadata
Content :
Using
Administrterin
Complexity, size, - > impossible to adminsiter without metadata
Building
source siystem, source to target mappin, transformation rules etc.
1. Operational metadata
2. Extraction and transformation metadata
3. End-user metadata
Conformed Constellation of Data Marts
(Bus Architecture)
Source DB -> Landing Area -> ??
Characteristics
§ No “explicit” DW, no common schema
§ DW bus: conformed facts & dimensions
§ Common mapping to dimensions through Master Data store(s)
Integrated Data Warehouse Architecture
Only Source DB -> Staging Area + DW integrated
Master Data just another source DB"
All Data Marts fed from the integrated Data Warehouse
Whats the Process steps and milestones in ETL PRocess?
Extract: Data is extracted from one or more sources, such as databases, file systems, or APIs.
Transform: Data is transformed into a format that is suitable for the target system, often through processes such as cleansing, standardization, and aggregation.
Load: Data is loaded into the target system, such as a data warehouse or data lake.
Milestones: Extract, Clean, Conform, Deliver
ETL Extrqaction compoenent, what are monitoring strategies?
Snapshot-based:
– periodically copy complete data sets into a file
– compare snapshots to detect changes
Timestamp-based:
– update timestamps of tuples upon change
– compare timestamps and transfer changed tuples
Log-based: analyze the DMBS’s transaction log to detect changes
Trigger-based: active mechanisms lead to extraction upon change in source data
Replication based: replication mechanism to transfer changed data
Transformation component in ETL
Prepare the data
Structural preparation: schema integration
Content: Data Cleaning and Integration
Steps in Transformation ETL
1. Transform data into a consistent format
2. Data Cleaning: handle
§ missing values
§ redundant values
§ inconsistent values
Data Scrubbing: use domain knowledge (business rules) to detect “dirty”
Data Auditing
§ Judge the quality of the resulting data
Name the 3 Classes of Filtering in the four transofmration layer: Filtering, Harmonization, Aggregation, Enrichment
Class1 - automatic detection, automatic
correction
CLass2 - automatic detection, manual
Class 3 - manual detection, manual
What are the Four Transformation Layers in ETL
Filtering, Harmonization, Aggregation, Enrichment
Data Marts
“stand-alone”
Analyse the performacen of a single business process
less cokplex and easier to underand
faster queries
Does the fact table of a data mart has many rows or many columsn?
Many columns with data which is easy to understand.
Difference between Fact and Dimensions in a Datamart?
Im Allgemeinen stellen Fakten den Kern des Data Mart dar und bilden die Basis für Analyse- und Berichtszwecke. Dimensionen dienen hingegen dazu, Fakten in einen größeren Kontext zu setzen und sie auf verschiedene Weise zu segmentieren und zu kategorisieren.
Fakten sind Daten
quantitative Informationen enthalten
gemessen, aggregiert oder berechnet werden können
Beispiele für Fakten in einem Data Mart sind Verkaufszahlen, Umsätze und Gewinne.
Dimensionen sind Daten
die kontextualisierende Informationen bereitstellen
häufig zur Segmentierung oder Klassifizierung von Fakten verwendet werden.
Beispiele für Dimensionen in einem Data Mart sind Kunden, Produkte, Zeit und Standorte.
Question: What is the snowflake schema? Pros and Cons also pls
Answer: The snowflake schema is a type of database schema that is used to organize large amounts of data in a data warehouse. It is called a "snowflake" schema because the diagram of the schema resembles a snowflake.
Pros:
+ Small savings in storage space
+ Normalized structures are easier to update and maintain
Cons:
- Schema becomes less intuitive and end-users are put off by the complexity
- Browsing through contents becomes more difficult
Name the 5 desired propeties of a OLAP System “FASMI”
1. Fast: regular queries within 5 sec, complex ones within 20 sec
2. Analysis: intuitive analysis and arbitrary calculations
3. Shared: effective access control in a multi-user environment
4. Multidimensional: on a conceptual level, data is presented as a
multidimensional view (irrespective of the actual implementation of
the underlying data structure)
5. Information: Scalability and acceptable response time for large data
In an OLAP cube, typical operations on pre-joined and often pre-aggregated tables include:
Roll-up: This operation aggregates data from a lower level of granularity to a higher level, such as from monthly sales data to annual sales data.
Drill-down: This operation breaks down data from a higher level of granularity to a lower level, such as from annual sales data to monthly sales data.
Slice and dice: This operation involves selecting and filtering data based on specific dimensions or measures, such as selecting only sales data for a specific product line or region.
Ranking: top performers, top risks etc.
R-OLAP:
M-OLAP:
H-OLAP:
R-OLAP: Relational OLAP
M-OLAP: Multi-Dimensional OLAP
H-OLAP: Hybrid OLAP -> Combination of R-OLAP and M-OLAP
R-OLAP = Relational OLAP difference to M-OLAP = Multi-dimensional OLAP
M-OLAP = Multi-dimensional OLAP
Proprietary DBMSs optimized for storage of multidimensional data
+ Response time (runtime efficiency)
+ Smaller storage footprint
(due to better compression)
+ Query performance may be better
+ More powerful specialized techniques for multi-dimensional data than general purpose DBMSs
+ More suitable for complex calculations that don’t translate well into SQL
R-OLAP = Relational OLAP
Star- and Snowflake-Schemas in classic, relational DBMSs
+ Compatibility → can be accessed by any SQL reporting tool
+ Flexibility, not limited to cube structure
(BUT: don’t reuse OLTP tables!)
+ No pre-computation of cubes, aggregations calculated upon request
+ More scalable for very large data volumes
+ Shorter load times
+ Costs may be lower
+ Better scalability for high cardinality dimensions
H-OLAP = Hybrid OLAP
Zuletzt geändertvor 2 Jahren