Correlation Information
Data
Information
Knowledge
Insights
Wisdom
What pruning strategies can be used?
Subtree-Replacement
Subtree-Raising
Definition Big Data
information assets with volumes, velocities and/or variety requiring innovative forms of information processing for enhanced insight discovery, decision-making and process automation.
4 V’s of Big Data
Volume
Velocity sources of data are growing rapidly
Variety heterogeneous sources of data
Value value after applying data analysis
Artificial Intelligence (AI)
How the smart and intelligent machines are designed
mathematical modelling
Maschine Learning (ML)
Optimization of objective/target functions
methods and algorithms
ability to learn without being explicitly programmed
teach themselves to grow and change when exposed to new data
Data Mining (DM)
Discovering a hidden knowledge and patterns
data, finding patterns
Business Intelligence (BI)
integration of strategies, processes and technologies
in order to deliver strategic knowledge about status, potentials and perspectives
from distributed and heterogeneous enterprise, market and competitive data
Data Science (DS)
iterdisciplinary field about processes and systems to extract knowledge or insights from data in various forms, either structured or unstructured
Maschine Learning Classes
Supervised Learning
Unsupervised Learning
Reinforcement Learning
Abkürzung KDD
Knowledge Discorvery in Databases
Abkürzung CRIPS-DM
Cross Industry Standard Process for Data Mining
Bedeutung CRISP-DM
generic process model
describes commonly used approaches that data mining and machine learning experts use to tackle problems
Core Components of a Data Science Project
People/Team
Soft-/Hardware
What makes Data Science Projects special?
Many stakeholders
High/wrong expectations
wrong datasets (biased, noisy, wrong, …)
company has no data driven decision culture
lack of expertise in DA/ML/DL/DE
Explain Maschine Learning Model
maps data to output
Learning Algrorithms can be used to aqquiere this Model with a Learning Data
Name all Data Mistakes
Bias
Cherry Picking
False Causality
McNarmara Fallacy
Danger of Summary
Simpson’s Paradox
Overfitting
Explain Bias
Drawing conclusions from a set of data that isn’t representative of the population you’re trying to understand.
Explain Cherry Picking
The practice of selecting results that fit your claim and excluding those that don’t. The worst and most harmful example of being dishonest with data.
Explain False Causality
To falsely assume when two events occur together that one must have caused the other
Explain McNarmara Fallacy
Relying solely on metrics in complex situations can cause you to lose sight of the bigger picture.
Explain Danger of Summary
It can be misleading to only look at the summary metrics of data sets.
Explain Simpson's Paradox
A phenomenon in which a trend appears in different groups of data, but disappears or reverses when the groups are combined.
Explain Overfitting
A more complex explanation will often describe your data better than a simple one. However, a simpler explanation is usually more representative of the underlying relationship.
Best Approaches for Data Science
use reallife based models (instead of playground data)
Feature Scaling (standartisation)
Feature Selection (preprocessing)
cross-validation (use a test learning set)
do just right fit (model interpration)
Look at data before results
Naive Baseline Model
Understand costfunction
Define Data Integration
correct, complete and (as far as possible) redundancy-free data combined from different data sources and stored in a (semi-)structured data store
Define Information Retrieval
process of obtaining information resources that are relevant to an information need (e.g. full-text searches)
Give Expamles for Data Storage Elements for Data Integration
File System
File
Markup File
Database
Web Forms
Web Services
Applications
Give examples of system frameworks for data integration
Data Warehouse Systems
Federated database system
Portals, Integration of News
(Meta-) Search engines
Define Data Warehouse (DW)
subject-oriented, integrated, non-volatile, and time-variant collection of data in support of managements decisions
Descirbe Data Analysis
iterative process
over long period of time
may have various evaluation perspectives / goals
requires cooperation and collaboration of different actors (e.g., stakeholders, IT systems, etc.)
By what is Data Analysis heavily driven by?
Methodology
Expertise
Domain knowledge and intuition
What are the goals of the Data Analysis Process?
Data Description
Exploration of Data
Dimension reduction
Model Development
Improvement of the existing models
Prediction and classification
Comparison of different groups
Hypothesis Testing
What does EDA stand for?
Exploratory Data Analysis
What approaches can be used for Data Analysis?
Exporative Data Analysis (EDA)
Knowledge Discovery in Database
Statistical Analysis
Describe KDD
the nontrivial process of identifying valid, novel, potentially, useful and ultimately understandable patterns in data
is an interactive, multi-step and iterative process
data mining is a central step
What steps are in the generic approach to Data Integration?
Pre-Processing
Scheme Comparison
Scheme Adaptation
Integration & Restructering
Which Strategies for Data Integration can be used?
Bottum-Up
To-Down
How can Data Integration Conflicts be resolved?
Explicit Value Mapping
similarity measures
original/reference data source
Usage of general knowledge or domain-specific knowledge
What are the Core Requierements for Data Integration?
Completeness
Correctness
Sufficiency
Complexity
Explain the Core Requirement for Data Integration Completeness
The integrated schema must include all concepts contained in any local schema; no information contained in a local schema should be lost or ignored
Explain the Core Requirement for Data Integration Correctness
all information contained in the integrated schema must be semantically equivalent to at least one local schema
inter-schema relationships newly added during integration must not contradict information from the local schemas i.e. only consistent extensions to the existing schemas allowed
Explain the Core Requirement for Data Integration Sufficiency
Real-world concepts modeled in multiple local schemas should be represented only once in the integrated schema
Explain the Core Requirement for Data Integration Complexity
The integrated scheme should be easy to understand
Which univriate graphical representation are used in EDA?
Stem-and-Leaf Display ()
Histogram
Box-Plot
Scatter
Curves
How is time graphicly represented in EDA?
Time-Series Data
How is spaciallity graphicly represented in EDA?
Spatially structured data - Special Maps
What are the Phases of CRISP-DM?
Business Understanding
Data Preparation
Modelling
Evaluation
Deployment
Name the steps of the KDD Process (Core Phases)
task analysis
preprocessing
data mining
postprocessing
deployment
What are the most time consuming phases of the KDD core modell?
What is the goal of the association analysis?
analysis of relationships in transactions (e.g. crossselling)
What is the goal of segmentation?
to form a cluster from similar objects
What is a cluster?
set of objects, which have
high degree of similarity to each other
lowest possible degree of similarity to other objects outside the cluster
Name different classses of clustering methods
partioning clustering
hierachical clustering
distibution-based clustering
density-based clustering
grid-based clustering
model-based clustering
What is the goal of classification?
assign an object to (predefined) classes
What are the Phases of Clustering?
Training
Application
What is the goal of prediction?
extrapolation of a given time series into the future
What is the drawback of Neural Networks?
reduced interpretability (black box)
What differentiation can be made for decision tree algorithms?
incremental
non-incremental
What is an incremental decision tree algorithm?
a particular subsets of a training data could be processes multiple times by the algorithm for the decision tree construction
What is a non-incremental decision tree algorithm?
operate on the whole dataset at once
What are the core processes of the C4.5 decision tree algortihm?
Form Tree Process
Growth Tree Process
Prune Tree Process
Last changed2 years ago