Defintion Business Intelligence
Processes, technologies and tools that are need to turn data into information, information into knowledge and knowledge into wisdom or plans that drive profitable business action.
OLAP?
On-line Analytical Processing
Describe an analytics-based organization
Analytics as necessity for success and survival
Data-driven business decision rather than intuition
Entire business drivne by analytics
3 types of analytics
Descriptive Analytics: Display data that is relevant
Predictive Analytics: Discover insights in data
Prescriptive Analytics: Compute best decisions and present them as recommendations
Types of Information Integration
Distributed DB: Distributed queries across multiple databases -> Slow execution!
Centralized DB: 1 single centralized DB -> Slow in operative Use
=> Need enterprise wide data model / architecture
Challanges of Information Integration?
Technical heterogeneity: different systems, communication systems, archticture
Structural and syntactic heterogeneity: different encoding, units, synonyms
Semantic heterogeneity: different interpretation of data
Datasilos?
Data Silos contain data that is only available for a specific part of the organization
Approaches to address challenges of Information Integration?
Federation: Point-to-Point Integration
-> n application/ data sources -> n^2 connections
Warehouse: Sources are translated from their local schema to a global scheme and copied to a central DB
-> unidirectional flows
Mediator (Virtual Warehouse): Turns user query into a sequence of source queries and assembles the results of these queries into an aggregated result
-> Complex architecture
Definition Data Warehouse
A storage architecture designed to hold data extracted from transaction systems, operational data stores and external sources. The ware house then combines that data in an aggregated, summery form suitable for enterprise wide data analysis and reporting for predefined business needs.
Characteristics of Data Warehouses?
Subject-Orientation: Data is not stored by operational applications but by business subjects -> data is grouped around subjects
-> Simple data querying
Integration: Contains consolidated data from several application/ DB
Stored Data that is not updated or deleted
Data inconsistencies are removed -> Need for standardizatiojn
Non-volatile and time-variant:
No overwriting of “old” data
Contains historical data to
allow analysis of the past,
relate information to the presen
and enable forecasts for the future
OLTP vs. OLAP
OLTP = Online Transaction Processing:
Operational Systems, which support the execution of business processes
OLAP = Online Analytical Processing:
Analzying the data describing business transactions/ processes
Objective: Turn raw transactional data -> strategic/ tactical/ operative information
Requirements for OLTP?
Optimized for small and short transactions
Access to up-to-date, consistend DBs
Avoidance of redundancies -> normalized schemas
Requirements for OLAP?
Queries with large result stes
No immediate updates or inserts, but large periodic batch insets
(Controlled) redundanca is a neccessity for performance reasons
Goal: Response Time of seconds/ a few minutes
Why does the Water and Oil analogy fit for OLTP vs. OLAP?
Lock Conflicts: long-running OLAP reads may block OLTP writes
Feshness of data
OLTP: up-to-date data
OLAP: reproducibility of analysis
Precision:
OLTP: exct
OLAP: smaplind, statistical summaries, confidence intervals
DWH Reference Architecture
Source DB in DWH Reference Architecture?
DB of an application which supports one or more types of business transactions
Landing Area in the DWH Reference Architecture?
DB that is able to store a single data extract of a subset of one Source DB
-> Data of Source DB is typically accessible via APIs and follow a pre-defined schema of a subset of data
Staging Area in the DWH Reference Architecture?
DB that is able to store matching data extracts from various Landing Areas in an integrated form
-> Upload to DW from Staging Area happens once data from all Landing Areas are available
-> Scheme basically corresponds 1:1 with the DW schema
Data Warehouse in the DWH Reference Architecture?
DB containing the history of all complete Staging Areas
-> Its integrated schema is frequently more or less in 3NF
Data Mart in DWH Reference Architecture?
Objective: DB containing data describing the (present or past) performance of one or more types of business transactions
Master Data in DWH Reference Architecture?
“Factured out“ information required by many applications to establish the context of data
-> Typically ends up in the dimensional tables of star schema
“Structured“ master data: customers, vendors, products, categorization, attributes/ codes
Categorization of Master Data?
Owned externally
Owned internally
Meta Data in DWH Reference Architecture?
Data about Data in DWH
-> Necessary for
using,
building
and administering a DWH
Types of DWH
Operational metadata
Extraction and transformation metadata
End-user metadata
Name Use Cases of DWH
Retail: Customer Loyalty, Market Planning
Manufacturing: Cost Reduction, Logistics Management
Utilities: Asset Management, Resource Mangement
Airlines: Route profitability, Yield Management
Name typical Architectural Options of DWH Refenrence Architecture!
Single stand-alone data mart: May be use if there is no need for integration
Separation of OLAP solution from a single integrated OLTP system
Local Reporting / OLAP solution for (non-integrated) OLTP system
Several independent stand-alone data marts: “quick and dirty solution“ (-> soon to be replaced)
A Conformed constellation of data marts (sharing common dimensions)
-> DW Bus Architecture
No “explicit“ DW -> no common schema
Conformed facts and dimensions
Mapping through Master Data stores
A single Integrated Data Warehouse feeding all data marts
Master Data can be seen as just another Source DB
Typically 1 landing area for each Source DB + integrated Staging Area
Multiple Data Warehouses for different functions (problematic but not uncommon)
ETL?
Extraction, Transformation, Loading
Why have been Virtual DWHs not yet widely adopted?
Heterogeneity of legacy systems
Strain on operative systems from management queries
Lack of historization in operative systems
Requirement of complex transformation rules
General Process of ETL?
Extract information from OLTP sources
Transform it
Load into the DWH
Why are “intermediate results“ stored in the Landing Area and so on?
Checking of Data Quality
Serves as backup source if a process failes
How can the Extraction component or phase be monitored? What strategies are there for the timing of the Extraction?
Triger-based: change in source DB -> Extraction
Replication based
Log-based: analysis of logs to detect changes
Timestamp-based:
Snapshot-based: Periodicall copy complete data sets into a file and compare snapshots to detect changes
What is done in the Transformation phase?
Extracted data is prepared for the loading into the DW
Structural preparation: schema integration
Content preparation: Data Cleaning and Integration
What steps are involved in the Transformation Phase?
Transformation of extracted data into a consistent format (-> data type, units, encodings, etc.)
Data Cleaning: Handling of missing or redundant or inconstent or outdated values
Data Scrubbing: Usage of domain knowledge / business rules to detect “dirty” data
Data Auditing: Assessment of the data quality and usage of data mining techniques to detect rules and deviations
4 Transformation Layers?
Filtering
automatice detection, automatioc correction
automatic - manual
manual - automatic
Harmonization
Syntactic harmonization -> Consistent format
Business harmonization -> Business indicators
Aggregation
-> Using dimensional hierarchy tables and parallel hierarchies
Enrichment: Calcultion of business performance indicators and storage as attributes
What are implementation decisions for ETL?
Buying ETL tool that offers various services
Building your own solution based on file technology and/ or DBMS technology
Goals of Designing Analytic Databases/ Data Marts
In general Data Marts support analysis of 1 or more business processes/ transactions by e.g.
running queries that need to look at a substantial amout of data
=> Maximizing query read performance through caching and Denormalization
What is the typical design process of data marts?
Seperate analytical databases from operational databases
Determine how to measure the execution of a given business process (-> KPIs) and look at the context (-> dimensions) of this business process
Design corresponding Star or Snowflake Schema
Tune using indexes other physical design options
What ingredients for data marts can be extracted from the analysis of business processes?
Measure
Context (category)
Time
+ aggregation & selection
Components of Star Schema?
Dimension Tables: Provide “business context” and are used for categorization
Not normalized
Time Dimension: Special dimension (table) time
Fact Tables: has a grain defining most atomic unit of data being captured (e.g. transaction level: measures in certain context + daily / weekyl / monthly / quarterly summaries)
Attributes:
measures (basic or derived)
surrogates -> foreign key to a dimension table
Primary Key
Facts vs. Dimensions
Facts
Dimensions
Usually verbs
Usually nouns
Business Performance indicators
Represent criteria
Usually numeric and continious
Usually symbolic and discrete
Can be aggregated
Serve the selection, aggregation and navigation of facts
Data Volume: high
Data Volume: low
Star vs. Snowflake Schema
Star Schema: Database schema where a single central table (fact table) is connected to one or more denormalized dimension tables.
Snowflake Schema: Variation of Star Schema, where dimension tables are normalized. This means that the dimension tables are broken down into multiple related tables, rather than having all of the attributes in a single table
Pros and Cons of Snowflake Schema?
Pros:
Small saving in storage space
Normalized structures are easier to update and maintain
Cons:
Schema becomes less intuitive -> complexity
Browsing through contents becomes more difficult
Decreased query performance due to additional joins
Name the 5 desired properties of OLAP Systems
F ast:
A nalysis:
S hared:
M ultidimensional
I nformation: Scalability
-> FASMI
Operations in multidimensional data space?
Slicing: filter conditions in 1 or more dimensions
-> E.g. Filter after a certain year
Dicing: Adding / Removing / Exchaning dimensions
Drill-Down: Going from a broader level of aggregation to a finer more detailed level
-> add a column to group by clause
Roll-Up: Going from a finer level of level of aggregation to a broader level
-> remove column from group by clause
Ranking: e.g. top performers, top risks, etc.
Name 3 physical Implementations of OLAP
R-OLAP: Relational OLAP
M-OLAP: Multi-Dimensional OLAP
H-OLAP: Hybrid OLAP
R-OLAP vs. M-OLAP
R-OLAP
M-OLAP
Star- and Snowflake-Schemas in classic, relational DBMSs
Proprietary DBMSs optimized for storage of multidimensional data
+ Compatiblity (can be accessed by any SQL tool)
+ Flexibility (not limited to cube structure)
+ Better scalability for large data volumes and high number of dimensions
+ Shorter load times
+ Costs may be lower
+ Faster response time and better query performance
+ Smaller storage footprint
+ More powerful as specialized technology for the specific purpose
+ More suitable for complex calculations that do not translate well into SQL
H-OLAP?
Combination or R-OLAP and M-OLAP
Vertical Partinioning:
M-OLAP for aggregation (Roll-ups)
R-OLAP for Drill-Down
Horizontal Partioning:
Certain of data in M-OLAP
Older or less frequently used data in R-OLAP
Name 3 R-OLAP Optimizations
Horizontal Partitioning: Split table into n disjoint parts with the same schema -> Union of all parts equals original table
By range
By hash (application of hash-function)
Round Robin: Archive oldest partition and reuse it for the latest data
By reference
Optimization for scan/ selection operations: Bitmap Indexes
Optimization for memory hierarchy: Vertical Partinioning/ Column Stores: Divide a table into multiple tables that contain fewer columns
Unique Key colume is replicated for joining
Decomposed Storage Model (DSM)?
Extreme case of vertical partitioning: Division of each table into a set of 2-column tables (RID, attribute)
Column Stores?
Specialized stores that use DSM to layout data on a disk
DWH Waterfall Development Process?
Specification: Assessment of current state and requirements
Platform, Software and Tool Selection: Analysis and evaluation of potential alternatives
Realization
Conceptualization
Implementation
Execution
Operation
What are typical criterias for DWH DB selection?
Scalabiltiy: -> Data Volume, Complexity of Queries
OLAP capabilities: -> Cupe operation
Optimization for analytical queries
Cloud or on-premise?
Top-down approach of Inmon vs. Bottom-up approach of Kimball
Inmon
Kimbal
Top-down approach
Bottom-up approach
Build centralized DW first and then departmental Data Marts
1 Data Mart for each major business process rather then building centralized DW first
-> Enterprise wide cohesion through data bus
Deductive
Inductive
Generic -> Specific
Specific -> Generic
3 (dis-)advantage of Inmons Top-down Approach
Advantages
Disadvantages
Enterprise wide view of data
Longer Building Time
Inherently architectonic
High Risk of failure -> high effort without proof of concept
Single, central storage of data
Requirement of high levels of cross-functional skills
Centralized rules and controls
Quick results if implemented in iterations
3 (dis-)advantages of Kimballs Bottom-up Approach
Faster and easier implementation
No enterprise wide view as each Data Mart has its own narrow view of data
Favorable ROI and Proof of Concept
Redundant data in every Data Mart
Lower risk of failure
Inconsistent and irreconcilable data
Inherenty incremental
Proliferates unamangeable interfaces
Learning by doing
Name 3 fundamental concepts of Kimballs DW Lifecycle Process
Focus on business
Dimensionally structure the data that is delivered to the business
Iteratively develop overall data warehouse environment in manageable lifecycle increments
Kimballs DW Lifecycle Process
Project vs. Program in Kimballs DW Lifecycle Process?
Project = Single iteration of the lifecycle
Program = Ongoing coordination of resources, infrastructure, …
What is done in Technology Track of Kimballs DW Lifecylce Process?
Technical Architecture Design
Product Selection and Installation
What is done in Data Track of Kimballs DW Lifecylce Process?
Dimensional Modeling (-> Enterprise data warehous bus matrix)
Physical Design (Design and set up of database environment)
Data Integration (ETL)
What is done in Applications Track of Kimballs DW Lifecylce Process?
Reporting and analytics design
Reporting and analytics development
Process of Agile BI Approaches
Identify Knowledge Workers
Conduct Interviews
Identify Source Data
Design Data Repositories (Enterprise DW + Departmental Data Marts)
Design ETL
Motivation behind Agile BI Approaches?
Evolving user and business requirements and new business rules
Problems and Risks of Agile BI Approaches?
An architectual approach is still necessary
Management across interdependent BI projects is critical
Conflict with top-down, slow-moving data governance
Not necessarily cheaper
Recent Developments in Data Warehousing?
HTAP (Hybrid Transaction/ Analytical Processing) Systems: more informed and real time decision making
-> Combination of OLAP and OLTP system
Cloud DWH/ BI
Mobile BI
Zuletzt geändertvor 2 Jahren