undefined

Buffl

DMA

by Cynthia A.

Overview of KDD - Process

Data Cleaning and Integration
Transformation, Selection, Projection
Data Mining
Visualization, Evaluation

Data Cleaning and Integration

1st step in KDD Process
up to 60 % of effort needed
integration and harmonization
removal of inconsistencies and noise
calculation of missing values

Transformation, Selection, Projection

2nd step of KDD

selection of important tuble/attributes/data

Data Mining

3rd step of KDD
pattern recognition in data
- Frequent Pattern Mining: correlation and causalities
- Clustering: (Dis-) Similarity
- Classification: prediction of class membership
- Regression: prediction of numerical output values for new objects
- Outlier Detection
- Trend-/Evolution analysis
- Process Mining
- Spatial Data Mining
- Graph Mining

Visualization, Evaluation

4th step of KDD
visual presentation of data/knowledge
transformation of data
removal of redundant patterns

Overview of Data Types

simple	composed	complex	special
numerical	sequence / vector	multimedia data	ordinal
categorical	sets	spatial data	metrical data
	relations	structures

simple

composed

complex

special

numerical

sequence / vector

multimedia data

ordinal

categorical

sets

spatial data

metrical data

relations

structures

What types of searches/queries are most commonly used for the data types?

Similarity Approaches	Fine-Refine Architecture
k means for categorical, ordinal, complex data types, numerical	everything with/for distance function and metrix
SVMs by transformation into vector space	used in similarity approaches
decision trees for simple data types

What categories of visualization techniques are there?

geometrical
polygonal
icon based
pixel orientated

Geometrical Visualization techniques

Scatterplots	Scatterplot Matrix	expanded Scatterplots
2D data in coordinate system	2D, relationship btw. multiple variables	3D+ dimensions, form and colour
correlation between variables cluster and outlier detection	pairwise correlation <-> only for pairwise relationships -> may not discover complex connections	the more dims, the more complex interpretation

Scatterplots

Scatterplot Matrix

expanded Scatterplots

2D data in coordinate system

2D, relationship btw. multiple variables

3D+ dimensions, form and colour

correlation between variables

cluster and outlier detection

pairwise correlation

<-> only for pairwise relationships -> may not discover complex connections

the more dims, the more complex interpretation

Polygonal Plots

Spiderweb Modell	parallel coordinates
polygonal lines per single obj tips = values per dim intersection of line and axis = value of data point in this dim	polygonal lines cutting multiple parallel axis for high dimensional data
single multidimensional obj <-> lost overview for multiple multidimensional objs	recognition of clusters and correlations <-> far distanced axis of correlating variables -> colours

Spiderweb Modell

parallel coordinates

polygonal lines per single obj

tips = values per dim

intersection of line and axis = value of data point in this dim

polygonal lines cutting multiple parallel axis for high dimensional data

single multidimensional obj

<-> lost overview for multiple multidimensional objs

recognition of clusters and correlations

<-> far distanced axis of correlating variables -> colours

icon based visualization

Chernoff faces

facial characteristics for values per dimension

intuitive (dis-) similarity detection for humans

<-> guessed/involuntary interpretations due to different perspectives of faces

<-> max. 18 dims

pixel orientated technique

recursive patterns

every data value = 1 data point

every dim = own window

order of pixels by recursive pattern

visualization of big datasets

pattern recognition of data distribution

<-> complex interpretation

<-> additional information about order of pixels required

Overview of Data Reduction Types

Numerosity Reduction	Dimensional Reduction	Quantisation & Discretisation
number of obj	number of attributes	number of values per domain

Numerosity Reduction

Sampling	Aggregation
random selection of subset of data	summarisation of data points to a new representative obj
for overview of giant data <-> information loss	mostly statistical measurements (mean, std)

Sampling

Aggregation

random selection of subset of data

summarisation of data points to a new representative obj

for overview of giant data

<-> information loss

mostly statistical measurements (mean, std)

Dimensional Reduction

Linear Method	Non linear method
feature sub-selection: subset of relevant attributes	multidimensional scaling: order of data points in lower dimensional space distance btw points = similarity in original space
Principal Component Analysis (PCA): transformation of data in new coordinate space, showing maximal variance of data in new dimension	neuronal embedding: usage of NN to embed data points into lower level vector spaces
Random Projections into lower dimensional space via random matrices
Fourier-Transformation (FT) and Wavelet Transformation (WT) separation of (ir-) relevant data by transformation into frequency range

Linear Method

Non linear method

feature sub-selection:

subset of relevant attributes

multidimensional scaling:

order of data points in lower dimensional space

distance btw points = similarity in original space

Principal Component Analysis (PCA):

transformation of data in new coordinate space, showing maximal variance of data in new dimension

neuronal embedding:

usage of NN to embed data points into lower level vector spaces

Random Projections

into lower dimensional space via random matrices

Fourier-Transformation (FT) and Wavelet Transformation (WT)

separation of (ir-) relevant data by transformation into frequency range

Quantisation, Discretisation

Binning	Generalisation through hierarchies
separation of data into intervals (bins) renaming of original values with interval number or representative value	summarisation of data with same characteristic to one new tuble with aggregated attributes

Binning

Generalisation through hierarchies

separation of data into intervals (bins)

renaming of original values with interval number or representative value

summarisation of data with same characteristic to one new tuble with aggregated attributes

Concept of OLAP

Online Analytical Processing

conceptual hierarchies for different abstraction levels

Roll-up	Drill-down	Slice & Dice	Pivot
summarisation of data via dimensional reduction or transition to higher hierarchical levels	expansion of data via addition of new dimensions or transition to lower hierarchical levels	selection of data along one or more dimensions	change of data view via rotation of data cube

Concept of AOI

Attribute-Orientated Induction

automatic data reduction

analysis of data via algorithm -> decision on wether to keep, remove or generalise data

Different Aggregate Measures

distributive	algebraic	holistic
can be calculated	algebraic function with limited number of arguments	not bound by size of storage for calculation and visualisation
combination of partial aggregations to one data set	distribute aggreagtive measurements used
spanning width IQR count sum min / max	mean mid-range variance std	median modus

distributive

algebraic

holistic

can be calculated

algebraic function with limited number of arguments

not bound by size of storage for calculation and visualisation

combination of partial aggregations to one data set

distribute aggreagtive measurements used

spanning width

IQR

count

sum

min / max

mean

mid-range

variance

std

median

modus

Overview of Classification

unsupervised learning

systematical categorization of new observations into known categories according to specific criteria

-> learned by training data

Modell construction/Training phase
Prediction phase/usage
repeat

important values

accuracy
compactness
interpretation ability
efficiency
scalability
robustness

Bayes Classification

based on prediction theory and Bayes theorem

object assignment to class with highest probability

Linear Discrimination Function

separates data points in different classes via hyper level

Support Vector Machine (SVM)

search for best hyper level for maximisation of distance between classes

Kernel method

expanded linear classificator on non linear data spaces through kernel function usage

Decision Tree

construction of hierarchical tree structure using attributes to make decisions

Nearest Neighbour Classification

sorts object of class to nearest neighbour

Ensemble Classification

combines multiple classificators to improve performance

Join Course

Preview

Author

Cynthia A.

Information

Last changed
4 months ago

Report course

DMA 1

Author

Cynthia A.

Information