Question: During which era was the goal of data processing primarily focused on reliability?
Answer: Age of Transactions (1970-1980s)
Explanation: The goal during this era was to ensure that no data was lost. Hierarchical data models such as IMS were used in the 1960s, while relational data models such as DB2, Oracle, and Sybase were used in the 1970s and 1980s.
Question: During which era was the goal of business intelligence primarily focused on aggregating and analyzing data to make business decisions?
Answer: Age of Business Intelligence (1995-2010)
Explanation: The goal during this era was to aggregate and analyze data in order to make business decisions. Decision support systems, dashboards, and OLAP (Online Analytical Processing) were used. ROLAP (relational OLAP) was used by systems such as SAP BW and Business Objects, while MOLAP (multi-dimensional OLAP) was used by systems such as Oracle Hyperion and Essbase.
Question: During which era was the goal of processing large, polystructured data primarily focused on?
Answer: Age of "Big Data" (since 2010)
Explanation: In this era, efficient technologies for processing large, polystructured data such as the Hadoop Stack and Spark are a primary focus. The goal is to perform exploration, predictive forecasting, prescriptive models, optimization, and real-time decision making.
What are the drawbacks of using both distributed and centralized databases and what are some common solutions to overcome those drawbacks?
Answer: Both distributed and centralized databases have their drawbacks.
Explanation: Using a distributed database allows for distributed queries across databases by creating a view with UNION, but the execution of distributed queries can be slow. On the other hand, a centralized database can be used by everyone but may be slow in operational use and neglects heterogeneous requirements. Both approaches require a single enterprise-wide data model.
Technical heterogeneity
A: Technische Heterogenität umfasst verschiedene Systeme und Kommunikationsprotokolle.
different systems and communication protocols
– File-based applications (including Excel spreadsheets)
– “Legacy” mainframe applications using e.g. IMS and CICS
– 2-tier client/server applications using relational DBMSs
– Multi-tier applications using e.g. CORBA, EJB, Web application servers etc.
– Enterprise service bus (ESB)
Structural and Syntactic heterogeneity
A: Srukturelle und syntaktische Heterogenität bezieht sich auf Unterschiede in der Art und Weise, wie Daten codiert und strukturiert werden.
“home-grown” codes vs. ISO encodings for countries and currencies
– colors specified as enumerations vs. RGB values
– different units, e.g. km/h vs mph, € vs. $, etc.
– synonyms, homonyms; implicit contexts
Semantic heterogeneity
Semantische Heterogenität bezieht sich auf Unterschiede in der Interpretation von Daten.
Erklärung: Semantische Heterogenität bezieht sich auf Unterschiede in der Interpretation von Daten und kann dazu führen, dass dieselben Daten von verschiedenen Personen oder Systemen unterschiedlich interpretiert werden. Es ist wichtig, semantische Heterogenität in Betracht zu ziehen, wenn man versucht, Daten aus verschiedenen Quellen zu integrieren oder zu analysieren.
Federation
Everybody talks directly to everyone else (Point-to-Point Integration
Warehouse
Sources are translated from their local schema to a global schema
and copied to a central DB
Mediator
Virtual warehouse – turns a user query into a sequence of source
queries and assembles the results of this queries into an “aggregate”
result
Wahts the issue with Integration approach 1: Federation?
n aplöications / data stores -> up to n^2 connections
to many connections
Integration Approach 2: Warehouse: Issue=?
usually only unidirectional data flows supported
Integration Approach 3: Mediator: Issue?
Issue:
complex architecture, potentially slow, difficult to maintain
What is a data Warehouse?
Answer: A data warehouse is a centralized repository for storing and managing large amounts of data from various sources.
tores data from various sources such as transactional databases, log files, and social media feeds
Designed for fast querying and analysis of data
Contains historical data as well as current data
Helps support decision making through the use of business intelligence tools
May use dimensional modeling to organize data for efficient querying
Can be updated periodically through the process of ETL (extract, transform, load)
Is the Data in a DWH stored by operational applications?
No, by business subjects
Data is grouped around subjects, and its structure is designed to make querying
the data simple, especially for business analysts.
What happens with data in a DWH if new data comes in?`
DWH: When new current data becomes available, the “old” data is not overwritten.
A data warehouse has to contain historical data to
– Allow for analysis of the past
– Relate information to the present
– Enable forecasts for the future
OLTP ? With examples
Online Transaction Processing
– Take an order
– Process a claim
– Make a shipment
– Generate an invoice
– Receive cash
– Reserve an airline seat
OLAP ? with examples
Online Analytical Processing
– Management Information Systems (MIS)
– Decision Support Systems (DSS)
– Statistical Databases
Difference between OLTP and OLAP? in requirements
OLTP:
– Optimize for many short and “small” transactions:
point queries, single-row updates and/or inserts
– Access to up-to-date, consistent DB
– Avoid (uncontrolled) redundancies → use normalized schemas
OLAP:
– Queries with large result sets (all the data, joins)
– No immediate updates and/or inserts,
but large periodic (daily, weekly) batch inserts
– (Controlled) redundancy a necessity for performance reasons: denormalized schemas, materialized views; indexes
– Goal: Response Time of seconds / a few minutes
Difference between OLTP and OLAP? As list
Data Content
Data Structure
Access Frequency
Access Type
Usage
Response Time
USers
What does serializablity mean?
In other words, it guarantees that if multiple transactions are being executed concurrently, the final state of the database will be the same as if the transactions were executed one after the other in some serial order.
What do we need for a good Data Warehouse Reference Architecutre? What are the Steps ?
Source DB ->Landing Area -> Staging Area -> DWH -> DAta Mart
Whats the Staging / ETL Area?
Exctraction, Transfomration, Loading (ETL)
Extraction – extract data from numerous data sources
▪ Transformation – perform a number of individual tasks:
– Cleaning
– Standardization
– Combining pieces of data from different sources
▪ Loading – two distinct groups of tasks
– The initial load moves large volumes of data into the data warehouse storage
– Feed the incremental data revisions on an ongoing basis
Data Lake =?
Answer: A data lake is a central repository that allows you to store all your structured and unstructured data at any scale.
Stores data in its raw and unmodified form
Can store structured, semi-structured, and unstructured data
Scalable and flexible storage solution
Allows for the storage of large volumes of data from various sources
Enables the use of various tools and technologies for data processing and analysis
Supports both batch and real-time data processing
Helps facilitate the creation of a single source of truth for an organization's data
Question: How does a data warehouse differ from a data lake?
Answer: A data warehouse is designed for structured data and fast querying, while a data lake is meant for storing and processing large amounts of structured and unstructured data.
Data warehouse stores structured data and is optimized for fast querying and analysis
Data lake stores structured, semi-structured, and unstructured data and is more flexible in terms of the types of data that can be stored
Data warehouse typically uses a structured schema on write, while data lake uses a schema-on-read approach
Data warehouse is typically used for reporting and business intelligence, while data lake is used for a wider range of use cases including data integration, analytics, machine learning, and more
Data warehouse is typically more expensive to set up and maintain than a data lake
Question: What are some potential caveats of using data lakes?
Answer: Data lakes can present challenges in terms of data governance, security, and management. Answer:
Data governance: Data lakes can contain large amounts of data from various sources, making it difficult to manage and ensure the quality and consistency of the data.
Security: Data lakes can store sensitive data, so proper security measures must be put in place to protect this data.
Management: Data lakes can require significant effort to set up and maintain, including the development of processes for data ingestion, transformation, and organization.
Data silos: Without proper planning and organization, data lakes can easily become a collection of data silos, making it difficult to find and access specific data.
Skills gap: Using a data lake requires a range of technical skills, including expertise in data ingestion, transformation, and analysis tools. It may be necessary to invest in training or hire specialized staff to effectively use a data lake.
Question: What is polyglot persistence?
Answer: Polyglot persistence is a software architecture pattern that uses multiple data storage technologies, each optimized for specific requirements.
Explanation:
Allows the use of the most appropriate data storage technology for each piece of data
Can improve scalability, performance, and flexibility
Can add complexity and require specialized skills to implement and maintain
– Combination of traditional DWH, Hadoop data lakes, MPPs, NewSQL, NoSQL, Search..
Last changed2 years ago