05 Data Mining

Buffl

Business Intelligence TU Wien

by nils K.

Fayyad's KDD process:

Knowledge Discovery in Databases process (KDD)
Iterative process with 6 steps
Steps:
- Selection: Selecting relevant data from the database
- Pre-processing: Cleaning and transforming the data
- Data reduction: Reducing the dimensionality of the data
- Data visualization: Creating visual representations of the data
- Modeling: Building models of the data
- Evaluation: Evaluating the models and the process
Focus on creating an end-to-end process

SEMMA

Sample: Select a subset of data for analysis
Explore: Investigate the data by creating descriptive statistics, visualizations and identifying relationships
Modify: Transform the data to make it more suitable for modeling
Model: apply various modeling (data mining) techniques to create models that possibly provide the desired outcome
Assess: Evaluate the performance of the models
Deploy: Implement the chosen model in a production environment

CRISP-DM

Cross Industry Standard Process for Data Mining
A process model with 6 phases:
- Business understanding
- Data understanding
- Data preparation
- Modeling
- Evaluation
- Deployment

ASUM-DM

ASUM-DM is a process model called Addressing Sparsity Using Multivariate techniques in Data Mining

Addressing Sparsity Using Multivariate techniques in Data Mining
A process model that addresses sparsity and high-dimensional data
Focus on handling sparse data, which is data with many missing values or low observations
Uses multivariate techniques to analyze high-dimensional data
Can be used in a wide range of industries and applications

(no longer active)

What do i need to do in Determining the business objectives ->

- Informally describe the problem to be solvedSpecify - all business questions as precisely as possible

- Specify expected benefits in business terms

- Beware of setting unattainable goals—make them as realistic as possible

Name different Types of Biases

- User-to-Data

- Data-to-Algorithm

- Algorithm-to-User

What is the Data-to-Algorithm

Bias?

- Measurement Bias: how we choose, utilize and measure

- Omitted Variable Bias: one or more important variables are left out of the model

- Representation Bias: how we sample from a population during data collection process

- Aggregation Bias: false conclusions are drawn about individuals from observing the entire population

(Simpson’s Paradox: e.g. UC Berkeley student admission study)

- Longitudinal Data Fallacy: applying longitudinal analysis across categories

What is the User-to-Data Bias?

- Historical Bias: already existing bias and socio-technical issues in the world manifests via the data generation process even given a perfect sampling and feature selection.

- Population Bias: emographics representatives and user characteristics are different in the user population

- Social Bias: actions by others affect our judgement.

- Self-Selection Bias: sub-type of sampling bias, where research question influences the data selection (self-motivating; neg

Is the goal of ML / DM to discriminate?

Yes, goal of ML / DM *is* to discriminate

What is oversampling?

What is Algorithm-to-User bias?

- Algorithmic Bias: algorithmic design choices, e.g. application on groups/subgroups, parameterization

- User Interaction Bias:

• Presentation Bias: influenced by the way information is presented, i.e. visual display, space allotted, …

• Ranking Bias: typical behavior, i.e. clicking on top-ranked results

- Popularity Bias: popular items receive more attention, not indicator of quality

- Emergent Bias: user behavior, cultural values or societal knowledge change

- Evaluation Bias: selection of wrong benchmarks, wrong metrics

Whats the human cognitiver biases?

Automation Bias,

Group attribution bias

implicit bias

confimcation bias,

in-group bias,

out-group homogenity bais

societal bias

Automation bias: trusting the machine without double-questioning

- Group attribution bias: what is true for individual is also true for group, may be due to non-representative sampling (statistical b.)

- Implicit bias: making unconscious assumptions in design

- Confirmation bias: hypothesis more likely to be confirmed by intentional or unintentional interpretation of data; or

continuing training until hypothesis is confirmed (Experimenter b.) “form of implicit bias”, can cause selection bias or data label bias

“What You See Is All There Is” (WYSIATI) bias.

- In-group bias: showing to one’s own group (friends, colleagues)

- Out-group homogeneity bias: seeing out-group members as more alike (less variation) than in one’s group ("they are alike; we are diverse“)

- Societal bias: shared by group, amplified: cultural assumptions, historical bias, “a form of data bias”; systemic bias, institutional bias

What are data biases?

Statistical bias
- Selection bias
  - Sampling bias: data not collected randomly
  - Coverage bias: samples do not represent deployment setting

- Confounding variables: variable that influences both dependent and independent variable

- Non-representative sampling: biased training data selection

- Missing features and labels: systematic errors

- Data processing: missing value imputation, outlier removal, …

- Data aggregation bias: groups of data with diff. distributions

 Engineering decision biases:

Algorithm selection bias: using linear model for non-linear problem; definition of ensemble methods

Hyperparameter tuning bias: architecture of DNN, activation function

Bias and Fairness hwat are important points on individual fairness, group fairness and subgrop fairness?

Individual Fairness:
- Similar predictions for similar individuals
- Issue: Definition of "similar"
Group Fairness:
- Treat different groups equally when group membership is not causal to treatment
Subgroup Fairness:
- Check whether fairness criteria hold over several subgroups

Waht do you have to do in Data unterstanding?

Task 2.2 Describe data

- Attribute types and values checking

- Volumetric analysis of data

• Identify data and method of capture

• Perform basic statistical analyses

• Report tables and their relations

• Check data volume, number of multiples, complexity

• Check specifically for free text entries

What is and Concepts , instance and attribute in a dataset?

 Concepts: things that can be learned

- E.g. list of topics for texts, spam/non-spam for email,

 Instance: example of a concept, data point

- E.g. individual text documents; animals; social network nodes; individual persons

 Attribute: measurement/description of an instance

- E.g. text described by BOW using tfidf;

Name 4 main attribtues

 4 main types

- Nominal

- Ordinal

- Interval

- Ratio

What is nominal attribute type?

Nominal (aka categorical)

Classification: class labels are nominal values

- Music: genres (jazz, pop, rock, …)

- Text: spam/non-spam; sports, politics, weather; report, interview

Attributes can be nominal too

- Persons: eye color, hair color, city of birth

- Nominal attributes can be numeric

(e.g. Zip-code, numeric encodings of categories)

What are ordinal attribute types

Impose an order on discrete categories

But: no distance defined!

Distinct labels from a defined vocabulary, numeric or strings

- Temperature: cold < cool < mild < hot < very hot

- Grades: A > B > C > D > E > F; 1 > 2 > 3 > 4 > 5

What is inveral - attribute type?

Ordered elements with fixed distance in-between

Discrete or continuous values

Distinct labels from a defined vocabulary, numeric or strings

- Time: year -> can calculate the difference

between 2011 and 2018

Ratio as attribute type?

Continuous values, zero-point defined

Usually represented as real numbers

Cannot be used as class labels! (-> binning or regression)

What is an Anscombe’s Quartet

Anscombe's quartet is a set of four datasets, each consisting of eleven (x,y) points, created by Francis Anscombe in 1973 to demonstrate the importance of graphing data before analyzing it. All four datasets have the same summary statistics (mean, variance, correlation, and linear regression line) but when plotted, they look very different. The purpose of Anscombe's quartet is to show that summary statistics can be misleading and that it's important to visualize the data before making any conclusions.

What is done in the Data exploration report

Analyze (visualize!) properties of interesting attributes in detail

Identify characteristics of sub-populations

Form hypotheses and identify actions

Transform the hypothesis into a data mining goal, if possible

Perform basic analysis to verify the hypotheses

Data quality report what is in there?

Identify special values and catalog their meaning

Check coverage (e.g., are all possible values represented?)

Identify missing attributes and blank fields

Check spelling and format of values

What are the 5 Tasks of Data Preparation?

Select Data

 Clean data

 Construct data

 Integrate Data

 Format Data

Waht do you do in Data Preparation process when selecting data? -> What are rationale for including / excvluding data?

Perform significance and correlation tests

Reconsider Data Selection Criteria (Task 2.1) in light of experiences of data quality and data exploration

Reconsider Data Selection Criteria (See Task 2.1) in light of experience of modeling (iterations)

Consider the use of sampling techniques

Identify attribute importance and consider options for weighting

Data preperation, clean data -> What do i do when data clening?

Reconsider how to deal with any observed type of noise

Correct, remove, or ignore noise

Add noise (!), data augmentation / synthetic data

Decide how to deal with special values and their meaning

3 Data Preparation

Task 3.3 Construct data

Output 3.3.1 Derived attributes

Activities

- Transform to different attribute types (Binning, 1-to-n coding, …)

- Decide if any attribute should be normalized

(e.g., k-means clustering algorithm with age and income)

- How can missing attributes be constructed or imputed?

Decide type of construction (e.g., aggregate, average, induction)

- Add new attributes to the accessed data

Preprocessing of Coding

Nominal/ordinal data —> Some algorithms can only handle numeric values. -> 1 to N coding

Binning (aka „Bucketing“) – What is it?

It is part of the preprocessing: Coding

Grouping continuous or numerical data into a smaller number of discrete "bins" or "buckets"
Dividing a range of continuous values into fixed number of intervals or bins
Assigning each data point to the bin that corresponds to its value
Make it easier to visualize and analyze the data
Can be used to group data into meaningful categories, create histograms, handle outliers, or group the data by certain range.

How can you deal with missing values in preprocessing?^

- Delete instance

- Ignore in calculation

- Data Imputation: substitute value

How can data be subsituet?

Mean value of the attribute (computed from other

samples)

• Random selection of value from another (similar?)

sample

• Regression – using other attributes to predict

• Clustering – values of cluster centroid

• Nearest Neighbour – value of closest sample

Why do I need to scale some data? Or when it is necesarry?

Different variables may exhibit vastly different value ranges

- E.g. a length variable measured in cm, inch, or meters

- Different types of measurements: length, speed,

temperature, …

Example: Animal dataset: Hight in centimeter and weight in kg —> Scale to hight in Meter and weight in kg

Reduction of distortion through scaling

What are forms of scaling?

Min-Max Scaling: Also known as normalization, this method scales the data to a fixed range, usually between 0 and 1. It is calculated as Xscaled = (X-Xmin) / (Xmax-Xmin)
Standardization (Z-Score Scaling): This method scales the data to have a mean of 0 and a standard deviation of 1. It is calculated as Xscaled = (X-mean(X)) / std(X)
Unit Length Scaling (or L2 normalization) : This method scales the data so that the magnitude of the feature vector is 1. It is calculated as Xscaled = X / ||X||

Should I exclude a input varibales that depend directly on each other? Measured with correlation

Yes. They might have unproportional weight on output prediction

What are tasks of modeling: 4 and output?

Sleect modeling technique

-> Modeling technique

-> modeling assumptions

Generate test design

-> Test design

Build model

-> Parameter settings

-> models

-> model description

Assess model

-> model assessment

-> revised parameter settings

What is one major goal of data preparation=?

Eliminate “wrong influence” of variables

What can be possible contrains while selecting a modeling techniqaue=?

– Political, management, understandability

– Technical support, HW platform, SW stack

– Performance, scalability

– Staff training/knowledge

What can be possible contrains while selecting a modeling techniqaue=?

– Political, management, understandability

– Technical support, HW platform, SW stack

– Performance, scalability

– Staff training/knowledge

What does regression do?

tries to predict a continuous variable (Supervised learning)

Supervised learning: Classification -> are the output varibles labeld?

Yes, Classification: discrete output variable

(pre-defined set of values) referred to as “class”

What are other Learning models?

Semi-supervised learning (labeld and unlabled data

Positive unary learning (PU) >

Reinforcement Learning (Not explicitly presenting input/output pair)

Zero-shot Learning

Learning a class for which there is no training data
Tries to identify intermediary concepts

Explain reinforcment learning

Reinforcement Learning (RL) ist eine Form des maschinellen Lernens, bei der ein Agent darauf trainiert wird, Entscheidungen zu treffen, indem er ein Belohnungssignal maximiert. Der Agent interagiert mit einer Umgebung und lernt, Aktionen auszuführen, die im Laufe der Zeit zu der höchsten Belohnung führen.

Explain Postive unary learning

Positive Unary Learning (PU) ist eine Technik des maschinellen Lernens, die sich darauf konzentriert, positive Beispiele zu identifizieren und daraus zu lernen. Im Gegensatz zu herkömmlichen Ansätzen, die sowohl positive als auch negative Beispiele verwenden, konzentriert sich PU nur auf positive Beispiele, d.h. Beispiele, die das gewünschte Verhalten oder die Eigenschaft aufweisen. Das Lernen basiert darauf, die Regeln oder die Struktur, die hinter den positiven Beispielen steckt, zu erkennen und zu generalisieren. PU hat sich als vorteilhaft erwiesen, wenn nur begrenzte positive Beispiele zur Verfügung stehen und negative Beispiele schwer zu beschaffen oder zu generieren sind. Es ist jedoch auch eine schwierige Methode, da es nicht möglich ist, aus negativen Beispielen zu lernen und somit weniger Informationen zur Verfügung stehen.

Zero-shot learning explain

Zero-Shot-Learning ist eine Technik des maschinellen Lernens, die es einem Modell ermöglicht, neue Objekte oder Kategorien zu erkennen und zu klassifizieren, ohne dass es während des Trainingsphasen Beispiele dafür gesehen hat. Das Modell lernt, auf Basis von vorab gelernten Beziehungen zwischen Kategorien und bestimmten Merkmalen, die als "Semantische Embeddings" bezeichnet werden, über diese neuen Kategorien zu generalisieren. Es erfordert also ein vorgegebenes semantisches Wissen über die Kategorien, die es klassifizieren soll, um sie zu erkennen.

Es hat sich als nützlich erwiesen, wenn es keine ausreichenden Beispiele für alle Kategorien im Trainingsdatensatz gibt, und es ermöglicht es dem Modell, auf neue, unbekannte Kategorien zu generalisieren, ohne dass es extra trainiert werden muss.

Goal: Generate test design

Define a procedure to test the model’s quality and validity prior to

modeling

- Define train / test / validation data sets

Output 4.2.1 Test design

Describe plan for training, testing, and evaluating the models

Task 4.3 Build model, Goal

Run the modeling tool on the prepared dataset to create one or more

models.

 Outputs

- 4.3.1 Parameter settings

- 4.3.2 Model(s)

- 4.3.3 Model description

Goal, Task 4.4 Assess model

Goal

- Assess model to determine in how far it meets the data mining

success criteria

- Purely technical assessment

Output of Task 4.4.1 model assessment (Bewertung des Modells)

Cross-check with Data Mining Success Criteria

Test result according to a test strategy

Interpret results in business terms

Check effect on data mining goal

Analyze potential for deployment of each result

Output of Task 4.4.2 Revised parameter settings

- Model task. Iterate model building and assessment until you find the best model

According to the model assessment, revise parameter settings and tune them for the next run

What steps are involved in “Evaluation” in CRISP-DM

Evaluate results

review process

Determine next steps

Whats the overall goal of evaluation?

RESULTS = MODELS + FINDINGS
Findings need not be related to any questions or objectives,
This step assesses the degree to which the model meets the business objectives
Determine if there is a business reason why this model is deficient

5.1 Evaluate reulsts -> Goal?

- Assesses degree to which model meets the business objectives

- Check if there is some business reason why this model is deficient

- Ideally test the model(s) on test applications in the real application or

on test data (no return to parameter tuning!)

Output: 5.1.1. Assessment of data mining results: Activities?

- Understand the data mining results & interpret in terms of application

- Evaluate and assess results w.r.t. business success criteria

- Compare evaluation results and interpretation

Output: 5.1.2 Approved models, Activies

Select and approve the generated models that meet the selected criteria

- Aim for formal approval by project initiator / all stakeholders

- May need to include revised deployment plan, cost estimate

- Provide risk analysis of deployment (error rate impact)

Subgroup fainess?

checks whether fairness criteria (e.g. equal false positive rate) hold over several subgroups

5.3. Detmine next steps 2 outputs

Output 5.3.1 List of possible actions

Output 5.3.2 Decision

Deployment constists of?

Plan Deployment

Plan Monitoring and Maintanace

Produce Final report

Review Project

Task 6.2 Plan monitoring and maintenance, Goal?

- Monitoring and maintenance are essential in continuous use

- Monitoring for data drift, bias, …

- Needs to be (semi-)automated!

- Maintenance strategy

The output of 6.2 Plan monitoring and maintance is Monting and plaetance plan, waht are activiets in there?

- Check for dynamic aspects

- Decide how accuracy/errors/… will be monitored

- Determine when result or model should not be used any more

Task 6.3 Produce final report -> what is the output?

Output 6.3.2 Final presentation

Output 6.3.2 Final presentation, what are activities?

- Decide on target group for the presentation

- Select which items from the final report to be included in presentation

- Communicate clearly, addressing the target groups!

CRISP-DM ALL

Join Course

Preview

Author

nils K.

Information

Last changed
3 years ago

Report course