03 DWH Part 2

von nils K.

R-OLAP abkürzung?

M-OLAP

= Relational OLAP Online Analytical Processing

= Multi-dimensional OLAP

Question: What are relational databases?

Answer: Relational databases are a type of database that stores data in a structured format using tables and relationships. Explanation:

Store data in tables with rows and columns
Tables can be related to each other through foreign keys
Use SQL for querying and modifying data
Offer transactional support and ACID properties (atomicity, consistency, isolation, durability)
Suitable for storing structured data
Examples include MySQL, Oracle, and SQL Server

R-OLAP vs. M-OLAP difference?

R-OLAP

Star- and Snowflake-Schemas in classic, relational DBMSs

M-OLAP = Multi-dimensional OLAP

Proprietary DBMSs optimized for storage of multidimensional data

Positves of R-OLAP

+ Compatibility → can be accessed by any SQL reporting tool

+ Flexibility, not limited to cube structure

(BUT: don’t reuse OLTP tables!)

+ No pre-computation of cubes, aggregations calculated upon request

+ More scalable for very large data volumes

+ Shorter load times

+ Costs may be lower

+ Better scalability for high cardinality dimensions

Positives of M-OLAP

+ Response time (runtime efficiency)

+ Smaller storage footprint

(due to better compression)

+ Query performance may be better

+ More powerful specialized techniques for multi-dimensional data than general purpose DBMSs

+ More suitable for complex calculations that don’t translate well into SQL

Question: What are some optimization techniques for ROLAP?

Answer: There are several techniques that can be used to optimize ROLAP performance, including indexing, materialized views, and partitioning. Explanation:

Materialized views: Pre-computing and storing the results of frequently used queries can improve query performance.
Bitmap indexes: These indexes can be used to improve the performance of queries that use filters, especially when working with large datasets.
Vertical partitioning/DSM/Column stores: Storing data in columns rather than rows can improve query performance, especially for queries that only access a few columns.
Compression: Compressing data can reduce the amount of disk space required and improve query performance.
Optimizing for CPU caches: Techniques such as CSS-/CSB-trees and collaborative scans can help to improve query performance by making better use of CPU caches.

Question: What is horizontal partitioning and how is it used in ROLAP?

Answer: Horizontal partitioning involves dividing a large table into smaller chunks called shards, which can improve query performance and make maintenance easier. Explanation:

Each shard contains a subset of the rows from the original table
Shards are stored on different servers or storage devices
Queries can be run on a smaller subset of data, improving performance
Different shards can be managed and backed up independently, making maintenance easier
Pruning: Queries can be optimized by only accessing the relevant shards, rather than all of them.
Parallelize certain query operations: Hash joins can be parallelized across multiple shards to improve performance.
Add/remove huge data sets: Large data sets can be added or removed in a single operation by modifying the partitioning scheme.
To use horizontal partitioning with ROLAP, the schema must be designed with partitioning in mind and queries must be written to take advantage of the partitioning.

What are ways for Horizontal partitioning schemes in ROLAP

Range-based partitioning: Data is partitioned based on a range of values for a particular column, such as dates or numeric values.
Hash-based partitioning: Data is partitioned based on the result of a hash function applied to a particular column.
List-based partitioning: Data is partitioned based on a specific list of values for a particular column.
Composite partitioning: A combination of multiple partitioning criteria is used, such as range-based partitioning on one column and hash-based partitioning on another column.

What is indexing?

Answer: Indexing is a technique for improving the performance of database queries by creating a separate data structure that maps the values in a particular column to the rows that contain those values.

Question: How is indexing used in OLAP and star schemas?

Answer: In OLAP and star schemas, indexing can be used to improve the performance of queries that filter on dimensions or measures.

Indices should be optimized for reads (not updates)

Star schemas consist of a central fact table surrounded by dimension tables
Queries often involve filtering on the dimensions or aggregating measures
Data is typically loaded in bulk into the star schema and then accessed read-mostly, with multidimensional value-based access rather than key-based access

Question: What are bitmap indexes and how do they work?

Bitmap indexes are a type of index that stores a bit-vector for each distinct value of a single attribute, allowing efficient support for multidimensional queries.

Explanation:

Bitmap indexes are used to improve the performance of queries that filter on a particular attribute
The size of the bit-vector is equal to the number of tuples in the table
A bit is set to 1 if the tuple with the corresponding RID has the value represented by the bit-vector
Bitmap indices can be combined using boolean operations (AND, OR, NOT) efficiently
Bitmap indices can be compressed if they contain many 0s, using techniques such as run length encoding
Bitmap indices need to be decompressed at query time.

Vertical Partitioning and Column Stores

Idea: Divide a table into multiple tables that contain fewer columns

(each stored on disk sequentially)

Unique key column is replicated in all tables for joining
When scanning a table, only relevant data must be read from disk into main memory and then into CPU caches

What is a extrem case of vertical partitioning?

Answer: Storing each column of a table in a separate table, also known as columnar storage or a column-oriented DBMS.

Columnar storage can be more space-efficient and faster for queries that only access a few columns
May be less efficient for queries that access many columns or for updates
Stores data in columns rather than rows.

Question: What is the waterfall model and what are the phases of the process?

Answer: The waterfall model is a linear software development process that involves moving sequentially through a series of phases.

The phases of the waterfall model are:

Specification: Define the requirements and constraints of the project.
Platform, software, and tools selection: Choose the hardware and software platform, as well as any tools or libraries that will be used.
Realization: Design the overall system and create a detailed specification.
Conceptualization: Create a high-level design of the system.
Implementation: Write the code and test the individual components.
Execution: Integrate and test the system as a whole.
Operation: Deploy and maintain the system.

What is done in the Specification process in Waterfall mdoell? Main Ciritcal Result

Main:

– Assessment of the current state

– Elicit requirements of end-users and managers

Critical:

– Choice of interview partners

– Assignment of roles in the DW team

– Management support

– Shared vocabulary

Result:

– Project objective and scope definition

– Business and technical requirements

– Internal and external data sources

– Team members responsible for implementation

– Who is accountable for what

What are you doing in the Platform, software and tools selection, Waterfall?

1. Analysis of potential alternatives

2. Evaluation of alternatives based on a list of criteria

3. Invite vendors to submit a proposal

4. Decision

3. Realization phases Waterfall

1. Conceptualization

– Logical data modeling

– Physical data modeling

– Dialog design

– Modularization

2. Implementation

– Declarative Implementation (Data, Queries)

– Procedural Implementation

– Module and System test

3. Execution

– Installation

– Maintenance

4. Operating the DWH

Continuously optimize:

Loading time

Query response time

Storage requirements

Availability

Hardware upgrades

etc.

Is thecnology importzant part in the success of a dwh project?`

Typical Criteria for DWH DB selection

Scalability
OLAP capabilties: cube generator support, management of cubes, in-memory OLAP?
Optimizations for analytical queries
Cloud or on-premise

Kimball vs Inmon models

Top-down approach – Inmon model

Centrralizted DW then depertmental DMs to accommodate the requirements of various user groups

deductive

Generic -> Specific

Bottom-up approach – Kimball model

One DM per major business process then Enterprise-wide cohesion through data bus

Specific -> generic

Inductive

departmental DMs

Answer: Departmental data marts are smaller, more focused data warehouses that serve the needs of a specific department or business unit.

Deductive and inductive

Deductive reasoning starts with a general principle and applies it to a specific case, while inductive reasoning starts with specific observations and draws a general conclusion.

Whats is deductive and inductive ?

Question: What is in-memory OLAP and how does it work?

In-memory OLAP is a technique for performing online analytical processing (OLAP) by keeping the data and indexes in memory, providing faster access to data.

Benefit of Top Down approach (deductive) DWH

Advantages

+ Truly corporate effort, enterprise view of data

+ Inherently architectonic – not a union of disparate data marts

+ Single, central storage of data about the content

Disadvantages

– Takes longer to build even with an iterative method

– High exposure/risk to failure

– Needs high levels of cross-functional skills

Benefit of Bottom-Up approach (inductive) DWH

Advantages

+ Faster and easier implementation of manageable pieces

+ Favorable return on investment and proof of concept

+ Lower risk of failure

Disadvantages

– Each data mart has its own narrow view of data

– Redundant data in every data mart

– Inconsistent and irreconcilable data

3 Fundamentel concepts of Kimball’s DW Lifecycle Process

Focus on the business and its key performance indicators (KPIs)
Structure the data dimensionally to support ad hoc queries and reports
Use an iterative and incremental approach to development, rather than trying to implement everything at once.

Program/Project Planning in Kimball’s DW Lifecycle Process

1 Deployment

Orchestration and extensive planning crucial to ensure that everything fits together
Deployment should be deferred if any of the pieces (including training, documentation, validated date!) are not ready for prime time
Education and support infrastructure [!!| (support need will typically not decline over time!)

2 Maintenance

3 Growth

Program/Project Planning in Kimball’s DW Lifecycle Process

- Requires an understanding of business requirements and project management

(bi-directional arrows, i.e. iterative planning)

Project = single iteration of the lifecycle

- Program = ongoing coordination of resources, infrastructure, timelines, and

communication across multiple projects

Project Management in Kimball’s DW Lifecycle Process

Ensures that the activities remain on track and in sync

Activities:

Status monitoring
Issue tracking
Change control to preserve scope boundaries
Development of a comprehensive communication plan that addresses both the business and IT constituencies

Business requirements in Kimball’s DW Lifecycle Process

What are the 3 colors grouped into?

1. Technology track

2. Data track

3. BI applications track

Agile BI name objectives and approach

Answer: Agile BI is a flexible and iterative approach to business intelligence (BI) that emphasizes collaboration, iterative process and feedback loops

Approach:

User stories (rather than technical specifications)

Adaptive planning

Time-boxed

iterativeevolutionary development approach

Obejctive:

Deliver value in days or hours rather than months or years

Rapid and flexible response to change

Deliver the right information to the right business users at the right time

Name a few Agile BI Problmes Pitfalls and Risks

Management across interdependent BI projects

(prioritization, coordination, synchronization..)?

Conflict with top-down, slow-moving data governance?

Deployment: can operations keep up the pace?

[DevOps not a well-established concept in the BI context]

What is JEDUF?

Answer: JEDUF stands for "just enough, delivered up front." It is a practice used in Agile software development methodologies that involves delivering the minimum necessary functionality at the start of a project, with the intention of adding additional features and functionality iteratively as the project progresses.

DBaaS (database-as-a-service)

DBaaS, or database-as-a-service, is a cloud-based service model in which a third-party provider hosts and manages a database on behalf of its customers.

Beitreten

Vorschau

Author

nils K.

Informationen

Zuletzt geändert
vor 3 Jahren

Kurs melden