Overview of KDD - Process
Data Cleaning and Integration
Transformation, Selection, Projection
Data Mining
Visualization, Evaluation
1st step in KDD Process
up to 60 % of effort needed
integration and harmonization
removal of inconsistencies and noise
calculation of missing values
2nd step of KDD
selection of important tuble/attributes/data
3rd step of KDD
pattern recognition in data
Frequent Pattern Mining: correlation and causalities
Clustering: (Dis-) Similarity
Classification: prediction of class membership
Regression: prediction of numerical output values for new objects
Outlier Detection
Trend-/Evolution analysis
Process Mining
Spatial Data Mining
Graph Mining
4th step of KDD
visual presentation of data/knowledge
transformation of data
removal of redundant patterns
Overview of Data Types
simple
composed
complex
special
numerical
sequence / vector
multimedia data
ordinal
categorical
sets
spatial data
metrical data
relations
structures
What types of searches/queries are most commonly used for the data types?
Similarity Approaches
Fine-Refine Architecture
k means for categorical, ordinal, complex data types, numerical
everything with/for distance function and metrix
SVMs by transformation into vector space
used in similarity approaches
decision trees for simple data types
What categories of visualization techniques are there?
geometrical
polygonal
icon based
pixel orientated
Geometrical Visualization techniques
Scatterplots
Scatterplot Matrix
expanded Scatterplots
2D data in coordinate system
2D, relationship btw. multiple variables
3D+ dimensions, form and colour
correlation between variables
cluster and outlier detection
pairwise correlation
<-> only for pairwise relationships -> may not discover complex connections
the more dims, the more complex interpretation
Polygonal Plots
Spiderweb Modell
parallel coordinates
polygonal lines per single obj
tips = values per dim
intersection of line and axis = value of data point in this dim
polygonal lines cutting multiple parallel axis for high dimensional data
single multidimensional obj
<-> lost overview for multiple multidimensional objs
recognition of clusters and correlations
<-> far distanced axis of correlating variables -> colours
icon based visualization
Chernoff faces
facial characteristics for values per dimension
intuitive (dis-) similarity detection for humans
<-> guessed/involuntary interpretations due to different perspectives of faces
<-> max. 18 dims
pixel orientated technique
recursive patterns
every data value = 1 data point
every dim = own window
order of pixels by recursive pattern
visualization of big datasets
pattern recognition of data distribution
<-> complex interpretation
<-> additional information about order of pixels required
Overview of Data Reduction Types
Numerosity Reduction
Dimensional Reduction
Quantisation & Discretisation
number of obj
number of attributes
number of values per domain
Sampling
Aggregation
random selection of subset of data
summarisation of data points to a new representative obj
for overview of giant data
<-> information loss
mostly statistical measurements (mean, std)
Linear Method
Non linear method
feature sub-selection:
subset of relevant attributes
multidimensional scaling:
order of data points in lower dimensional space
distance btw points = similarity in original space
Principal Component Analysis (PCA):
transformation of data in new coordinate space, showing maximal variance of data in new dimension
neuronal embedding:
usage of NN to embed data points into lower level vector spaces
Random Projections
into lower dimensional space via random matrices
Fourier-Transformation (FT) and Wavelet Transformation (WT)
separation of (ir-) relevant data by transformation into frequency range
Quantisation, Discretisation
Binning
Generalisation through hierarchies
separation of data into intervals (bins)
renaming of original values with interval number or representative value
summarisation of data with same characteristic to one new tuble with aggregated attributes
Concept of OLAP
Online Analytical Processing
conceptual hierarchies for different abstraction levels
Roll-up
Drill-down
Slice & Dice
Pivot
summarisation of data via dimensional reduction or transition to higher hierarchical levels
expansion of data via addition of new dimensions or transition to lower hierarchical levels
selection of data along one or more dimensions
change of data view via rotation of data cube
Concept of AOI
Attribute-Orientated Induction
automatic data reduction
analysis of data via algorithm -> decision on wether to keep, remove or generalise data
Different Aggregate Measures
distributive
algebraic
holistic
can be calculated
algebraic function with limited number of arguments
not bound by size of storage for calculation and visualisation
combination of partial aggregations to one data set
distribute aggreagtive measurements used
spanning width
IQR
count
sum
min / max
mean
mid-range
variance
std
median
modus
Overview of Classification
unsupervised learning
systematical categorization of new observations into known categories according to specific criteria
-> learned by training data
Modell construction/Training phase
Prediction phase/usage
repeat
important values
accuracy
compactness
interpretation ability
efficiency
scalability
robustness
Bayes Classification
based on prediction theory and Bayes theorem
object assignment to class with highest probability
Linear Discrimination Function
separates data points in different classes via hyper level
Support Vector Machine (SVM)
search for best hyper level for maximisation of distance between classes
Kernel method
expanded linear classificator on non linear data spaces through kernel function usage
Decision Tree
construction of hierarchical tree structure using attributes to make decisions
Nearest Neighbour Classification
sorts object of class to nearest neighbour
Ensemble Classification
combines multiple classificators to improve performance
Last changed20 days ago