Definition Big Data
information assets with volumes, velocities and/or variety requiring innovative forms of information processing for enhanced insight discovery, decision-making and process automation.
4 V’s of Big Data
Velocity sources of data are growing rapidly
Variety heterogeneous sources of data
Value value after applying data analysis
Artificial Intelligence (AI)
How the smart and intelligent machines are designed
Maschine Learning (ML)
Optimization of objective/target functions
methods and algorithms
ability to learn without being explicitly programmed
teach themselves to grow and change when exposed to new data
Business Intelligence (BI)
integration of strategies, processes and technologies
in order to deliver strategic knowledge about status, potentials and perspectives
from distributed and heterogeneous enterprise, market and competitive data
Data Science (DS)
iterdisciplinary field about processes and systems to extract knowledge or insights from data in various forms, either structured or unstructured
generic process model
describes commonly used approaches that data mining and machine learning experts use to tackle problems
What makes Data Science Projects special?
wrong datasets (biased, noisy, wrong, …)
company has no data driven decision culture
lack of expertise in DA/ML/DL/DE
Explain Maschine Learning Model
maps data to output
Learning Algrorithms can be used to aqquiere this Model with a Learning Data
Name all Data Mistakes
Danger of Summary
Drawing conclusions from a set of data that isn’t representative of the population you’re trying to understand.
Explain Cherry Picking
The practice of selecting results that fit your claim and excluding those that don’t. The worst and most harmful example of being dishonest with data.
Explain False Causality
To falsely assume when two events occur together that one must have caused the other
Explain McNarmara Fallacy
Relying solely on metrics in complex situations can cause you to lose sight of the bigger picture.
Explain Simpson's Paradox
A phenomenon in which a trend appears in different groups of data, but disappears or reverses when the groups are combined.
A more complex explanation will often describe your data better than a simple one. However, a simpler explanation is usually more representative of the underlying relationship.
Best Approaches for Data Science
use reallife based models (instead of playground data)
Feature Scaling (standartisation)
Feature Selection (preprocessing)
cross-validation (use a test learning set)
do just right fit (model interpration)
Look at data before results
Naive Baseline Model
Define Data Integration
correct, complete and (as far as possible) redundancy-free data combined from different data sources and stored in a (semi-)structured data store
Define Information Retrieval
process of obtaining information resources that are relevant to an information need (e.g. full-text searches)
Give Expamles for Data Storage Elements for Data Integration
Give examples of system frameworks for data integration
Data Warehouse Systems
Federated database system
Portals, Integration of News
(Meta-) Search engines
Define Data Warehouse (DW)
subject-oriented, integrated, non-volatile, and time-variant collection of data in support of managements decisions
Descirbe Data Analysis
over long period of time
may have various evaluation perspectives / goals
requires cooperation and collaboration of different actors (e.g., stakeholders, IT systems, etc.)
What are the goals of the Data Analysis Process?
Exploration of Data
Improvement of the existing models
Prediction and classification
Comparison of different groups
What approaches can be used for Data Analysis?
Exporative Data Analysis (EDA)
Knowledge Discovery in Database
the nontrivial process of identifying valid, novel, potentially, useful and ultimately understandable patterns in data
is an interactive, multi-step and iterative process
data mining is a central step
What steps are in the generic approach to Data Integration?
Integration & Restructering
How can Data Integration Conflicts be resolved?
Explicit Value Mapping
original/reference data source
Usage of general knowledge or domain-specific knowledge
Explain the Core Requirement for Data Integration Completeness
The integrated schema must include all concepts contained in any local schema; no information contained in a local schema should be lost or ignored
Explain the Core Requirement for Data Integration Correctness
all information contained in the integrated schema must be semantically equivalent to at least one local schema
inter-schema relationships newly added during integration must not contradict information from the local schemas i.e. only consistent extensions to the existing schemas allowed
Explain the Core Requirement for Data Integration Sufficiency
Real-world concepts modeled in multiple local schemas should be represented only once in the integrated schema
Explain the Core Requirement for Data Integration Complexity
The integrated scheme should be easy to understand
Which univriate graphical representation are used in EDA?
Stem-and-Leaf Display ()
Name the steps of the KDD Process (Core Phases)
What are the most time consuming phases of the KDD core modell?
What is the goal of the association analysis?
analysis of relationships in transactions (e.g. crossselling)
What is a cluster?
set of objects, which have
high degree of similarity to each other
lowest possible degree of similarity to other objects outside the cluster
Name different classses of clustering methods
What is an incremental decision tree algorithm?
a particular subsets of a training data could be processes multiple times by the algorithm for the decision tree construction