What is Data Mining?
Discovery/Extrraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases/data
Data Mining:
Frequent Pattern Mining
Clustering
Classification (predict the class for “new” objects)
Regression (predict the numerical value for “new” objects)
Process Mining
outlier and trend analysis
characterization, discrimination,
association rule mining (market basket ananlysis)
Data Mining and Machine Learning
Descriptive Learning
Better understanding – data mining I
examples: pattern recognition, clustering, outlier detection
Predictive Learning
Better forecasts – regression I
examples: trac prediction, labeling, fraud detection
Prescriptive Learning
Better actions – artificial intelligence I
examples: predictive maintenance, autonomous driving, medical therapies
The KDD-Process
Data Cleaning & Integration
may take 60% of effort
Integration of data from different sources
Elimination of noise, Computation of missing values
Tranformation
Discretization of numerical attributes (from numbers to ordinal)
Computation of derived/new tuples/rows and derived/new attributes:
Selection
Select the relevant tuples/rows from the database tables,)
Projection
Select the relevant attributes/columns from the database tables, e.g., (id, name, date, location, amount) (id, date, amount)
Data Mining
find patterns
Visualization
Different stages of visualization
visualization of data
visualization of data mining results
visualization of data mining processes
interactive visual data mining
Frequent Itemset Mining
Find frequently co-occurring items in the data —> indicate correlations or causalities
Ingredients of data types
operations for comparison, as needed for data mining
Data storage space
Singlevaluesneed1Byte=8Bit,4Byte=32Bit,8Byte=64Bit
Kilobytes 10^3, Megabytes 10^6, Gigabytes 10^9, Terabytes
10^12, Petabytes 10^15, Exabytes 10^18, Zettabytes 1021
Operations need calculation time
axioms - metric distance
Simple Data Types:
numerical
ordinal (There is a (total) order <= on the set of possible data values)
categorical = nominal
Composed Data Types
Sequences, Vectors
Order matters
comparison
Sets
Unordered collection of individual values
Comparison: Jaccard-Distance , Symmetric Set Difference
Bitvector Representation
Complex Data Types
Structure: graphs, networks, trees
Geometry: shapes, contours, routes, trajectories
Multimedia: images, audio, text, etc.
Similarity models: Approaches
Similarity Queries
similarity queries are basic operations in (multimedia) databases
Filter-Refine Architecture
ICES criteria for filter quality
Indexable – Index enabled
Complete – No false dismissals
Efficient – Fast individual calculation
Selective – Small candidate set
Data Visualization Techniques
Data Reduction
Better perception of patterns
reduce Computational complexity
approaches:
Data aggregation (basic statistics)
Data generalization (abstraction to higher levels)
Data Reduction Strategies: Three Directions
Data Aggregation
—> less tuples
Basic, Distributive, Algebraic, Holistic Aggregates
Basic:
mean, median, mode, variance
Distributive:
count, sum, min, max
Algebraic
average + variance
Holistic
median, mode, rank
Central Tendency (algebraic measures)
Applicable to numerical data only (sum, scalar multiplication)
Mean – (weighted) arithmetic mean (average)
mid-range
Average of the largest and the smallest values in a data set: (max + min)/2
for categorical data —> Median
for unordered data —> Mode
Boxplot Analysis
Histograms
use binning to approximate data distributions
Divide data into bins and store a representative (sum, average, median) for each bin
Equi-width Histograms
Divide the range into N intervals of equal size (uniform grid)
If A and B are the lowest and highest values of the attribute, the width of intervals will be (B- A)/N
Equi-height Histograms
Divide the range into N intervals, each containing approx. the same number of samples (quantile-based approach)
Data Generalization
A process which abstracts a large set of task-relevant data in a database from low conceptual levels to higher ones.
OLAP + operations
on-line analytical processing (OLAP), Data cube approach
OLAP operations:
Roll-up
Summarize data by climbing up hierarchy or by dimension reduction.
drill down
Reverse of roll-up. From higher level summary to lower level summary or detailed data, or introducing new dimensions.
slice and dice
Selection on one (slice) or more (dice) dimensions.
pivot (rotate)
Reorient the cube, visualization, 3D to series of 2D planes.
Roll up and Drill-Down in a Data Cube
SLICE
Dice
Specifying Generalizations by Star-Nets
granularities = Körnung
Discussion of OLAP-based Generalization
Attribute-Oriented Induction (AOI)
Three choices for each attribute: keep it, remove it, or generalize it
Attribute Generalization Control
Strategies for Next Attribute Selection
Last changed2 years ago