Fayyad's KDD process:
Knowledge Discovery in Databases process (KDD)
Iterative process with 6 steps
Steps:
Selection: Selecting relevant data from the database
Pre-processing: Cleaning and transforming the data
Data reduction: Reducing the dimensionality of the data
Data visualization: Creating visual representations of the data
Modeling: Building models of the data
Evaluation: Evaluating the models and the process
Focus on creating an end-to-end process
SEMMA
Sample: Select a subset of data for analysis
Explore: Investigate the data by creating descriptive statistics, visualizations and identifying relationships
Modify: Transform the data to make it more suitable for modeling
Model: apply various modeling (data mining) techniques to create models that possibly provide the desired outcome
Assess: Evaluate the performance of the models
Deploy: Implement the chosen model in a production environment
CRISP-DM
Cross Industry Standard Process for Data Mining
A process model with 6 phases:
Business understanding
Data understanding
Data preparation
Modeling
Evaluation
Deployment
ASUM-DM
ASUM-DM is a process model called Addressing Sparsity Using Multivariate techniques in Data Mining
Addressing Sparsity Using Multivariate techniques in Data Mining
A process model that addresses sparsity and high-dimensional data
Focus on handling sparse data, which is data with many missing values or low observations
Uses multivariate techniques to analyze high-dimensional data
Can be used in a wide range of industries and applications
(no longer active)
What do i need to do in Determining the business objectives ->
- Informally describe the problem to be solvedSpecify - all business questions as precisely as possible
- Specify expected benefits in business terms
- Beware of setting unattainable goals—make them as realistic as possible
Name different Types of Biases
- User-to-Data
- Data-to-Algorithm
- Algorithm-to-User
What is the Data-to-Algorithm
Bias?
- Measurement Bias: how we choose, utilize and measure
- Omitted Variable Bias: one or more important variables are left out of the model
- Representation Bias: how we sample from a population during data collection process
- Aggregation Bias: false conclusions are drawn about individuals from observing the entire population
(Simpson’s Paradox: e.g. UC Berkeley student admission study)
- Longitudinal Data Fallacy: applying longitudinal analysis across categories
What is the User-to-Data Bias?
- Historical Bias: already existing bias and socio-technical issues in the world manifests via the data generation process even given a perfect sampling and feature selection.
- Population Bias: emographics representatives and user characteristics are different in the user population
- Social Bias: actions by others affect our judgement.
- Self-Selection Bias: sub-type of sampling bias, where research question influences the data selection (self-motivating; neg
Is the goal of ML / DM to discriminate?
Yes, goal of ML / DM *is* to discriminate
What is oversampling?
What is Algorithm-to-User bias?
- Algorithmic Bias: algorithmic design choices, e.g. application on groups/subgroups, parameterization
- User Interaction Bias:
• Presentation Bias: influenced by the way information is presented, i.e. visual display, space allotted, …
• Ranking Bias: typical behavior, i.e. clicking on top-ranked results
- Popularity Bias: popular items receive more attention, not indicator of quality
- Emergent Bias: user behavior, cultural values or societal knowledge change
- Evaluation Bias: selection of wrong benchmarks, wrong metrics
Whats the human cognitiver biases?
Automation Bias,
Group attribution bias
implicit bias
confimcation bias,
in-group bias,
out-group homogenity bais
societal bias
Automation bias: trusting the machine without double-questioning
- Group attribution bias: what is true for individual is also true for group, may be due to non-representative sampling (statistical b.)
- Implicit bias: making unconscious assumptions in design
- Confirmation bias: hypothesis more likely to be confirmed by intentional or unintentional interpretation of data; or
continuing training until hypothesis is confirmed (Experimenter b.) “form of implicit bias”, can cause selection bias or data label bias
“What You See Is All There Is” (WYSIATI) bias.
- In-group bias: showing to one’s own group (friends, colleagues)
- Out-group homogeneity bias: seeing out-group members as more alike (less variation) than in one’s group ("they are alike; we are diverse“)
- Societal bias: shared by group, amplified: cultural assumptions, historical bias, “a form of data bias”; systemic bias, institutional bias
What are data biases?
Statistical bias
Selection bias
Sampling bias: data not collected randomly
Coverage bias: samples do not represent deployment setting
- Confounding variables: variable that influences both dependent and independent variable
- Non-representative sampling: biased training data selection
- Missing features and labels: systematic errors
- Data processing: missing value imputation, outlier removal, …
- Data aggregation bias: groups of data with diff. distributions
Engineering decision biases:
Algorithm selection bias: using linear model for non-linear problem; definition of ensemble methods
Hyperparameter tuning bias: architecture of DNN, activation function
Bias and Fairness hwat are important points on individual fairness, group fairness and subgrop fairness?
Individual Fairness:
Similar predictions for similar individuals
Issue: Definition of "similar"
Group Fairness:
Treat different groups equally when group membership is not causal to treatment
Subgroup Fairness:
Check whether fairness criteria hold over several subgroups
Waht do you have to do in Data unterstanding?
Task 2.2 Describe data
- Attribute types and values checking
- Volumetric analysis of data
• Identify data and method of capture
• Perform basic statistical analyses
• Report tables and their relations
• Check data volume, number of multiples, complexity
• Check specifically for free text entries
What is and Concepts , instance and attribute in a dataset?
Concepts: things that can be learned
- E.g. list of topics for texts, spam/non-spam for email,
Instance: example of a concept, data point
- E.g. individual text documents; animals; social network nodes; individual persons
Attribute: measurement/description of an instance
- E.g. text described by BOW using tfidf;
Name 4 main attribtues
4 main types
- Nominal
- Ordinal
- Interval
- Ratio
What is nominal attribute type?
Nominal (aka categorical)
Classification: class labels are nominal values
- Music: genres (jazz, pop, rock, …)
- Text: spam/non-spam; sports, politics, weather; report, interview
Attributes can be nominal too
- Persons: eye color, hair color, city of birth
- Nominal attributes can be numeric
(e.g. Zip-code, numeric encodings of categories)
What are ordinal attribute types
Impose an order on discrete categories
But: no distance defined!
Distinct labels from a defined vocabulary, numeric or strings
- Temperature: cold < cool < mild < hot < very hot
- Grades: A > B > C > D > E > F; 1 > 2 > 3 > 4 > 5
What is inveral - attribute type?
Ordered elements with fixed distance in-between
Discrete or continuous values
- Time: year -> can calculate the difference
between 2011 and 2018
Ratio as attribute type?
Continuous values, zero-point defined
Usually represented as real numbers
Cannot be used as class labels! (-> binning or regression)
What is an Anscombe’s Quartet
Anscombe's quartet is a set of four datasets, each consisting of eleven (x,y) points, created by Francis Anscombe in 1973 to demonstrate the importance of graphing data before analyzing it. All four datasets have the same summary statistics (mean, variance, correlation, and linear regression line) but when plotted, they look very different. The purpose of Anscombe's quartet is to show that summary statistics can be misleading and that it's important to visualize the data before making any conclusions.
What is done in the Data exploration report
Analyze (visualize!) properties of interesting attributes in detail
Identify characteristics of sub-populations
Form hypotheses and identify actions
Transform the hypothesis into a data mining goal, if possible
Perform basic analysis to verify the hypotheses
91
Data quality report what is in there?
Identify special values and catalog their meaning
Check coverage (e.g., are all possible values represented?)
Identify missing attributes and blank fields
Check spelling and format of values
What are the 5 Tasks of Data Preparation?
Select Data
Clean data
Construct data
Integrate Data
Format Data
Waht do you do in Data Preparation process when selecting data? -> What are rationale for including / excvluding data?
Perform significance and correlation tests
Reconsider Data Selection Criteria (Task 2.1) in light of experiences of data quality and data exploration
Reconsider Data Selection Criteria (See Task 2.1) in light of experience of modeling (iterations)
Consider the use of sampling techniques
Identify attribute importance and consider options for weighting
Data preperation, clean data -> What do i do when data clening?
Reconsider how to deal with any observed type of noise
Correct, remove, or ignore noise
Add noise (!), data augmentation / synthetic data
Decide how to deal with special values and their meaning
3 Data Preparation
Task 3.3 Construct data
Output 3.3.1 Derived attributes
Activities
- Transform to different attribute types (Binning, 1-to-n coding, …)
- Decide if any attribute should be normalized
(e.g., k-means clustering algorithm with age and income)
- How can missing attributes be constructed or imputed?
Decide type of construction (e.g., aggregate, average, induction)
- Add new attributes to the accessed data
Preprocessing of Coding
Nominal/ordinal data —> Some algorithms can only handle numeric values. -> 1 to N coding
Binning (aka „Bucketing“) – What is it?
It is part of the preprocessing: Coding
Grouping continuous or numerical data into a smaller number of discrete "bins" or "buckets"
Dividing a range of continuous values into fixed number of intervals or bins
Assigning each data point to the bin that corresponds to its value
Make it easier to visualize and analyze the data
Can be used to group data into meaningful categories, create histograms, handle outliers, or group the data by certain range.
How can you deal with missing values in preprocessing?^
- Delete instance
- Ignore in calculation
- Data Imputation: substitute value
How can data be subsituet?
Mean value of the attribute (computed from other
samples)
• Random selection of value from another (similar?)
sample
• Regression – using other attributes to predict
• Clustering – values of cluster centroid
• Nearest Neighbour – value of closest sample
Why do I need to scale some data? Or when it is necesarry?
Different variables may exhibit vastly different value ranges
- E.g. a length variable measured in cm, inch, or meters
- Different types of measurements: length, speed,
temperature, …
Example: Animal dataset: Hight in centimeter and weight in kg —> Scale to hight in Meter and weight in kg
Reduction of distortion through scaling
What are forms of scaling?
Min-Max Scaling: Also known as normalization, this method scales the data to a fixed range, usually between 0 and 1. It is calculated as Xscaled = (X-Xmin) / (Xmax-Xmin)
Standardization (Z-Score Scaling): This method scales the data to have a mean of 0 and a standard deviation of 1. It is calculated as Xscaled = (X-mean(X)) / std(X)
Unit Length Scaling (or L2 normalization) : This method scales the data so that the magnitude of the feature vector is 1. It is calculated as Xscaled = X / ||X||
Should I exclude a input varibales that depend directly on each other? Measured with correlation
Yes. They might have unproportional weight on output prediction
What are tasks of modeling: 4 and output?
Sleect modeling technique
-> Modeling technique
-> modeling assumptions
Generate test design
-> Test design
Build model
-> Parameter settings
-> models
-> model description
Assess model
-> model assessment
-> revised parameter settings
What is one major goal of data preparation=?
Eliminate “wrong influence” of variables
What can be possible contrains while selecting a modeling techniqaue=?
– Political, management, understandability
– Technical support, HW platform, SW stack
– Performance, scalability
– Staff training/knowledge
What does regression do?
tries to predict a continuous variable (Supervised learning)
Supervised learning: Classification -> are the output varibles labeld?
Yes, Classification: discrete output variable
(pre-defined set of values) referred to as “class”
What are other Learning models?
Semi-supervised learning (labeld and unlabled data
Positive unary learning (PU) >
Reinforcement Learning (Not explicitly presenting input/output pair)
Zero-shot Learning
Learning a class for which there is no training data
Tries to identify intermediary concepts
Explain reinforcment learning
Reinforcement Learning (RL) ist eine Form des maschinellen Lernens, bei der ein Agent darauf trainiert wird, Entscheidungen zu treffen, indem er ein Belohnungssignal maximiert. Der Agent interagiert mit einer Umgebung und lernt, Aktionen auszuführen, die im Laufe der Zeit zu der höchsten Belohnung führen.
Explain Postive unary learning
Positive Unary Learning (PU) ist eine Technik des maschinellen Lernens, die sich darauf konzentriert, positive Beispiele zu identifizieren und daraus zu lernen. Im Gegensatz zu herkömmlichen Ansätzen, die sowohl positive als auch negative Beispiele verwenden, konzentriert sich PU nur auf positive Beispiele, d.h. Beispiele, die das gewünschte Verhalten oder die Eigenschaft aufweisen. Das Lernen basiert darauf, die Regeln oder die Struktur, die hinter den positiven Beispielen steckt, zu erkennen und zu generalisieren. PU hat sich als vorteilhaft erwiesen, wenn nur begrenzte positive Beispiele zur Verfügung stehen und negative Beispiele schwer zu beschaffen oder zu generieren sind. Es ist jedoch auch eine schwierige Methode, da es nicht möglich ist, aus negativen Beispielen zu lernen und somit weniger Informationen zur Verfügung stehen.
Zero-shot learning explain
Zero-Shot-Learning ist eine Technik des maschinellen Lernens, die es einem Modell ermöglicht, neue Objekte oder Kategorien zu erkennen und zu klassifizieren, ohne dass es während des Trainingsphasen Beispiele dafür gesehen hat. Das Modell lernt, auf Basis von vorab gelernten Beziehungen zwischen Kategorien und bestimmten Merkmalen, die als "Semantische Embeddings" bezeichnet werden, über diese neuen Kategorien zu generalisieren. Es erfordert also ein vorgegebenes semantisches Wissen über die Kategorien, die es klassifizieren soll, um sie zu erkennen.
Es hat sich als nützlich erwiesen, wenn es keine ausreichenden Beispiele für alle Kategorien im Trainingsdatensatz gibt, und es ermöglicht es dem Modell, auf neue, unbekannte Kategorien zu generalisieren, ohne dass es extra trainiert werden muss.
Goal: Generate test design
Define a procedure to test the model’s quality and validity prior to
modeling
- Define train / test / validation data sets
Output 4.2.1 Test design
Describe plan for training, testing, and evaluating the models
Task 4.3 Build model, Goal
Run the modeling tool on the prepared dataset to create one or more
models.
Outputs
- 4.3.1 Parameter settings
- 4.3.2 Model(s)
- 4.3.3 Model description
Goal, Task 4.4 Assess model
Goal
- Assess model to determine in how far it meets the data mining
success criteria
- Purely technical assessment
Output of Task 4.4.1 model assessment (Bewertung des Modells)
Cross-check with Data Mining Success Criteria
Test result according to a test strategy
Interpret results in business terms
Check effect on data mining goal
Analyze potential for deployment of each result
Output of Task 4.4.2 Revised parameter settings
- Model task. Iterate model building and assessment until you find the best model
According to the model assessment, revise parameter settings and tune them for the next run
What steps are involved in “Evaluation” in CRISP-DM
Evaluate results
review process
Determine next steps
Whats the overall goal of evaluation?
RESULTS = MODELS + FINDINGS
Findings need not be related to any questions or objectives,
This step assesses the degree to which the model meets the business objectives
Determine if there is a business reason why this model is deficient
5.1 Evaluate reulsts -> Goal?
- Assesses degree to which model meets the business objectives
- Check if there is some business reason why this model is deficient
- Ideally test the model(s) on test applications in the real application or
on test data (no return to parameter tuning!)
Output: 5.1.1. Assessment of data mining results: Activities?
- Understand the data mining results & interpret in terms of application
- Evaluate and assess results w.r.t. business success criteria
- Compare evaluation results and interpretation
Output: 5.1.2 Approved models, Activies
Select and approve the generated models that meet the selected criteria
- Aim for formal approval by project initiator / all stakeholders
- May need to include revised deployment plan, cost estimate
- Provide risk analysis of deployment (error rate impact)
Subgroup fainess?
checks whether fairness criteria (e.g. equal false positive rate) hold over several subgroups
5.3. Detmine next steps 2 outputs
Output 5.3.1 List of possible actions
Output 5.3.2 Decision
Deployment constists of?
Plan Deployment
Plan Monitoring and Maintanace
Produce Final report
Review Project
Task 6.2 Plan monitoring and maintenance, Goal?
- Monitoring and maintenance are essential in continuous use
- Monitoring for data drift, bias, …
- Needs to be (semi-)automated!
- Maintenance strategy
The output of 6.2 Plan monitoring and maintance is Monting and plaetance plan, waht are activiets in there?
- Check for dynamic aspects
- Decide how accuracy/errors/… will be monitored
- Determine when result or model should not be used any more
Task 6.3 Produce final report -> what is the output?
Output 6.3.2 Final presentation
Output 6.3.2 Final presentation, what are activities?
- Decide on target group for the presentation
- Select which items from the final report to be included in presentation
- Communicate clearly, addressing the target groups!
CRISP-DM ALL
Zuletzt geändertvor 2 Jahren