undefined

by Leonie P.

What is Data Mining?

Discovery/Extrraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases/data

Data Mining:

Frequent Pattern Mining
Clustering
Classification (predict the class for “new” objects)
Regression (predict the numerical value for “new” objects)
Process Mining
outlier and trend analysis
characterization, discrimination,
association rule mining (market basket ananlysis)

Data Mining and Machine Learning

Descriptive Learning
- Better understanding – data mining I
- examples: pattern recognition, clustering, outlier detection
Predictive Learning
- Better forecasts – regression I
- examples: trac prediction, labeling, fraud detection
Prescriptive Learning
- Better actions – artificial intelligence I
- examples: predictive maintenance, autonomous driving, medical therapies

The KDD-Process

Data Cleaning & Integration

may take 60% of effort
Integration of data from different sources
Elimination of noise, Computation of missing values

Tranformation

Discretization of numerical attributes (from numbers to ordinal)
Computation of derived/new tuples/rows and derived/new attributes:

Selection

Select the relevant tuples/rows from the database tables,)

Projection

Select the relevant attributes/columns from the database tables, e.g., (id, name, date, location, amount) (id, date, amount)

Data Mining

find patterns

Visualization

Different stages of visualization

visualization of data
visualization of data mining results
visualization of data mining processes
interactive visual data mining

Frequent Itemset Mining

Find frequently co-occurring items in the data —> indicate correlations or causalities

Ingredients of data types

operations for comparison, as needed for data mining

Data storage space

Singlevaluesneed1Byte=8Bit,4Byte=32Bit,8Byte=64Bit
Kilobytes 10^3, Megabytes 10^6, Gigabytes 10^9, Terabytes
10^12, Petabytes 10^15, Exabytes 10^18, Zettabytes 1021

Operations need calculation time

axioms - metric distance

Simple Data Types:

numerical
ordinal (There is a (total) order <= on the set of possible data values)
categorical = nominal

Composed Data Types

Sequences, Vectors
- Order matters
- comparison
Sets
- Unordered collection of individual values
- Comparison: Jaccard-Distance , Symmetric Set Difference
- Bitvector Representation

Complex Data Types

Structure: graphs, networks, trees
Geometry: shapes, contours, routes, trajectories
Multimedia: images, audio, text, etc.

Similarity models: Approaches

Similarity Queries

similarity queries are basic operations in (multimedia) databases

Filter-Refine Architecture

ICES criteria for filter quality

Indexable – Index enabled

Complete – No false dismissals

Efficient – Fast individual calculation

Selective – Small candidate set

Data Visualization Techniques

Data Reduction

Better perception of patterns
reduce Computational complexity

approaches:

Data aggregation (basic statistics)
Data generalization (abstraction to higher levels)

Data Reduction Strategies: Three Directions

Data Aggregation

—> less tuples

Basic, Distributive, Algebraic, Holistic Aggregates

Basic:

mean, median, mode, variance

Distributive:

count, sum, min, max

Algebraic

average + variance

Holistic

median, mode, rank

Central Tendency (algebraic measures)

Applicable to numerical data only (sum, scalar multiplication)

Mean – (weighted) arithmetic mean (average)
mid-range
- Average of the largest and the smallest values in a data set: (max + min)/2

for categorical data —> Median

for unordered data —> Mode

Boxplot Analysis

Histograms

use binning to approximate data distributions
Divide data into bins and store a representative (sum, average, median) for each bin

Equi-width Histograms

Divide the range into N intervals of equal size (uniform grid)
If A and B are the lowest and highest values of the attribute, the width of intervals will be (B- A)/N

Equi-height Histograms

Divide the range into N intervals, each containing approx. the same number of samples (quantile-based approach)

Data Generalization

A process which abstracts a large set of task-relevant data in a database from low conceptual levels to higher ones.

OLAP + operations

on-line analytical processing (OLAP), Data cube approach

OLAP operations:

Roll-up
- Summarize data by climbing up hierarchy or by dimension reduction.
drill down
- Reverse of roll-up. From higher level summary to lower level summary or detailed data, or introducing new dimensions.
slice and dice
- Selection on one (slice) or more (dice) dimensions.
pivot (rotate)
- Reorient the cube, visualization, 3D to series of 2D planes.