What is data mining/knowledge discovery in databases?

Extraction of interesting informations or patterns from data in large databases

What are the roots of data mining?

Statistics

Machine Learning

Database Systems

Information Visualization

What are the 3 different types of learnings from Machine Learning?

Descriptive Learning

Better understanding - data mining

e.g pattern recognition, clustering, outlier detection

Predictive Learning

Better forecasts - regression

e.g. traffic prediction, labeling, fraud detection

Prescriptive Learning

Better actions - artificial intelligence

e.g. predictive maintenance, autonomous driving, medical therapies

What is the data explosion problem? What is the solution?

Tremendous amounts of data caused bt

Automated data collection

Mature database technology

Solution:

Data Warehousing and on-line analytical processing (OLAP)

Data Mining (give definition)

What are potential applications of data mining?

Database analysis and decision support

Market analysis and management

Risk analysis and management

Fraud detection and management

Other applications:

Text mining

Intelligent query answering

What is the (Knowledge Discovery in Databases )KDD-Process (Graphic)

What are the processes of data mining?

Frequent Pattern Mining

Clustering

Classification

Regression

Process Mining

What happens in the Data Cleaning & Integration step of the KDD Process?

may take 60% of the effort

Integration of data from different sources

Mapping of attribute names

Joining different tables

Elimination of inconsistencies

Elimination of noise

Computation of missing values e.g default value, average value

In the KDD process what is meant by the task which is focused on in task-relevant data?

Task:

Find useful features

Dimensionality/variable reduction

Invariant Representation

Creating a target data set

In the KDD process what are the different methods of focusing on task-relevant data?

Selections

Selecting the relevant tuples/rows from the database table

Projections

Selecting the relevant attributes/columns from the database tables

Transformations

Discretization of numerical attributes

Computation of derived tuples/rows

e.g. aggregation (2+ -> 1), making new attributes (1+ -> 1++)

What is the goal and task of basic data mining tasks?

Goal:

Find patterns of interest

Are there labels (in the training data)?

Many -> Supervised learning

Some few -> Semi-supervised learning

None -> Unsupervised learning

Choose fitting mining algorithms

What is the setting, motivation, and applications of frequent itemset mining?

Setting:

Given a database of transactions e.g.

Motivation:

Find the frequently co-occurring items in the set of transactions which indicate correlations or causalities

Applications:

Market-based analysis

Cross-marketing

Catalogue design

As a basis for clustering, classification

Association rule mining: determining correlations between different itemsets

What is the setting, motivation, and applications of clustering?

Database of objects O

Unknown class labels

Similarity model for objects, oft as function sim: O x O -> R

Group objects into clusters while maximising intra-cluster similarity (cohesion) and minimising inter-clustering similarity (separation)

Customer profiling/segmentation

Document or image collections

Web access patterns

What is the setting, motivation, and applications of classification?

Class labels are known for a set of training data

Find models/functions/rules that

describe and distinguish the classes

predict class membership for new objects

Classify disease type for samples based on gene expression values

Automatic assignment of categories

Predict unknown or missing values

What is the setting, motivation, and applications of regression?

Numerical output values are known for a training set

describe the numerical output values of the training data

predict the numerical value for new objects

Building a model of the housing values, which can be used to predict the price of a house in a certain area

Building a model of an engineering process as a basis to control a technical system

What is done in making generalisation levels as a basic mining task?

Generalise, summarise, and contrast data characteristics

Based on attribute aggregation along concept hierarchies

Data cube approach (OLAP)

Attribute-oriented induction approach

What is outlier detection?

Finding objects which do not comply with the general behaviour of the data

e.g. fraud detection, rare events analysis

What is trends and evolution analysis?

Sequential patterns (finding reoccurring sequences of events)

What are methods for special data types, and applications?

Process mining

Spatial data mining

Graphs

…

What is involved in the Evaluation & Visualization part of the KDD Process?

Pattern evaluation and knowledge presentation

What is data?

Representation of real objects, situations, processes

Measured by physical sensors

Recorded from digital systems

Generated by simulations

Stored and provided by computers

How to represent data?

Numerical and categorical data types

Similarity models -> allow for pattern mining

Data reduction -> to increase efficiency

What are the ingredients of data types?

identifiers (int, long, float)

domains ({1,2,…},{true,false},{‘hello’,’hi’})

operations (+,and,concat)

How can data be stored?

Single values need 1 byte, 1 Byte = 8 bit, 4 Byte = 32 bits

Big data sets allocated Kilobytes,…,Zettabytes

Data compression helps (partially)

What is the operational calculation time for

calculating the Euclidean distance of two objects from R^d

calculating the mean of n objects from R^d

calculating the covariance matrix of n objects x_i in R^d

O(d)

O(nd)

O(nd^2)

What are the different data types? (with examples)

Simple, basic data types (numbers, symbols)

Composed data types (vectors, sets)

Complex data types (multimedia, spatial, structural data)

What is a metric space, and how is it fulfilled?

A metric space (O,d) consists of object set O and metric distance function d: O x O -> R which fulfills:

Symmetry: ∀p,q ∈ O : d(p,q) = d(q,p)

Identity of Indiscernables: ∀p,q ∈ O : d(p,q) = 0 <=> p = q

Triangle Inequality: ∀p,q,o∈ O : d(p,q) <= d(p,o) + d(o,q)

e.g. points in 2D space with Euclidean distance

What are counter examples to metric space?

non-symmetric: one-way roads on the way from p to q

non-definite: indiscernable equivalent objects are not necessarily identical

triangles: sometimes detours are preferred

What is categorical data and how can one compare values?

Symbols/identifiers

e.g. subjects= {“maths”,”physics”}, occupation = {“butcher”}

e.g. trivial metric

d(p,q) = 0: if p==q, else 1

e.g. explicit similarity matrix

e.g. path length in generalization hierarchy

What are the characteristic fulfillments of ordinal data?

There is a total order ≤ on the set of possible data values O:

Transitivity: ∀p,q,o∈O : p≤q ∧ q≤o => p≤o

Antisymmetry: ∀p,q,o∈O : p≤q ∧ q≤p =⇒ p = q

Totality: ∀p,q∈O : p≤q ∨ q≤p

Examples:

words, lexicographic ordering: high <= highschool

(vague) sizes: tiny <= small <= medium <= big

frequencies: never <= seldom <= rarely <= sometimes

numbers: 0<=1, 1.78 <= 2.35

What are the characteristics of composed data types?

Given a domain D, a sequence s of length n is a mapping I_n -> D of the index set I_n = {1,…,n} into D, s ∈ D^n for short.

The sequence s concatenates n values from D; the order does not matter

What are the characteristics of set data types and how can one compare values?

Unordered collection of individual values

e.g. skills = {Java, C, Python}, hobbies = {skiing, diving, hiking}

comparison

+ hamming distance (sep card)

How can one use bitvector representations to measure hamming distances between sets?

Given an ordered base set B = (b1,…,bn), for any set S create a binary vector r ∈ {0,1}^n with r_i = 1 <=> b_i ∈ S.

Hamming distance: Sum of different entries (equal cardinality of symmetric set difference)

What are the complex data types and what are examples of them?

Structure: graphs, networks, trees

Geometry: shaped, contours, routes, trajectories

Multimedia: images, audio, text, etc.

What is the approach for similarity models between complex data types?

Direct measures: highly data type dependent

Feature extraction/feature engineering: explicit vector space embedding with hand-crafted features

Kernel trick: implicit vector space embeddings

Feature learning: vector space embeddings learned by deep learning methods e.g. neural networks

How can objects and attributes be conceptually modelled?

What occurs in feature extraction?

Objects from database DB are mapped to feature vectors

Feature vector space

Points represent objects

Distance corresponds to (dis)similarity

What are similarity queries and what are the different queries?

Similarity queries are basic operations in databases

Given: universe O, database DB ⊆ O, distance function d and query object q ∈ O

Range query

(k-) Nearest neighbor query

Ranking query

How is a range query calculated?

Range query for range parameter ε ∈ R^+_0:

range(DB, q, d, ε) = {o ∈ DB | d(o,q) ≤ ε}

How is a nearest neighbor query calculated?

Nearest neighbor query:

NN(DB,q,d) = {o ∈ DB | ∀o′ ∈ DB : d(o,q) ≤ d(o′,q)}

How is a k-nearest neighbour query calculated?

NN(DB,q,d,k) ⊆ DB with |NN(DB,q,d,k)| = k AND ∀o∈NN(DB,q,d,k),

o′∈DB−NN(DB,q,d,k)

:d(o,q)≤d(o′,q)

#

The function works by first calculating the distance between q and each object in DB. Then, it sorts the objects in DB by their distance to q. The k objects with the smallest distances are then returned as the nearest neighbors of q.

The function guarantees that the returned set of nearest neighbors will contain exactly k objects, and that all of the objects in the returned set will be closer to q than any other object in DB that is not in the returned set.

How is a ranking query calculated?

“get next” functionality for picking databse objects in an increasing order w.r.t their distance to q:

∀i ≤ j : d(q, rank_{DB,q,d} (i)) ≤ d(q, rank_{DB,q,d (j))}

How might a similarity search be executed naively?

Using a range query:

Naive search by sequential scan

Fetch database objects from secondary storage (e.g. disk): O(n) time

Check distances individually: O(n) time

How might a similarity search be executed quickly (faster than naive)?

Fast search by applying database techniques

Filter-refine architecture

Filter: Boil DB down to (small) candidate set C ⊆ DB

Refine: Apply exact distance calculations to candidates from C only

Indexing structures

Avoid sequential scans by (hierarchical or other) index techniques

Data access in time O(n), O(log n) or even O(1)

In filter-refine architecture, what is the principle of multi-step search? What is an example?

Fast filter step produces candidate set C ⊆ DB (by approximating distance function d’)

Exact distance function d is calculated on candidate set C only

Example: Dimensionality Reduction

What is the ICES criteria for filter quality?

Indexable - Index enabled

Complete - No false dismissals

Efficient - Fast individual calculation

Selective - Small candidate set

What is the principle of indexing?

To organize data in a way for fast access to relevant objects, e.g. by heavy pruning.

R-Tree as an example for spatial index structure:

Hierarchy of minimum bounding rectangles

Disregard subtrees which are not relevant for the current query region

What does data visualization do?

Data visualization transforms data in visually perceivable representations

Combines the capabilities:

Computers are good in number crunching

Humans are good in visual pattern recognition

What are the 4 types of data visualization techniques?

Geometric

Icon-based

Pixel-oriented

Other

What is in geometric data visualization techniques?

Visualization of geometric transformations and projections of the data

e.g. Scatterplots, Parallel Coordinates

What is in icon-based data visualization techniques?

Visualization of data as icons

e.g. Chernoff Faces, Stick Figures

What is in pixel-oriented data visualization techniques?

Visualize each attribute value of each data object by one coloured pixel

e.g. recursive patterns

What are other examples of data visualization techniques?

Hierarchical techniques, graph-based techniques, hybrid techniques

What are characterstics of 2D scatterplots?

Designed for 2D data, two axes span the drawing area

2D objects are depicted as points in the diagram

Orthogonal projections of points to axes indicate coordinate values

Provides a first look at bivariate data to see clusters of points, outliers, etc.

What are the characteristics of a scatterplot matrix?

Matrix of scatterplots for pairs of dimensions

Ordering of dimensions is important:

reordering improves understanding of structures and reduces clutter

Interestingness of orderings can be evaluated with quality metrics

What are limitations of scatterplots?

Scatterplots work well for 2 dimensions only

Only depict pairwise correlations

Projections get ambiguous for 3 or more dimensions

What are polygonal plots?

Depict objects by curves rather than by points

Coordinates are not indicated by orthogonal projections but by explicit intersections of polygons with axes

e.g. spiderweb - axes remain connected at origin

e.g. parallel coordinates - axes are aligned in parallel

What are characteristics of the spiderweb model?

Illustrate any single object by a polygonal line

Contract origins of all axes to a global origin point

Works well for few objects only

e.g. monthly average temperature

What are characteristics of parallel coordinates?

d-dimensional data space is visualised by d parallel axes

each axis is scaled to min-max range

object = polygonal line intersecting axis at value in this dimension

ordering important to reveal correlations

What are characteristics of pixel-oriented techniques?

Each data value is mapped onto a colored pixel

Each dimension is showin in a separate window

Arrangement strategies: recursive patterns iterated line and column-based arrangements

What are characteristics of chernoff faces?

Map d-dimensional space to facial expression, e.g. length of nose = dim 6; curvature of mouth = dim 8

What is the advantage with using chernoff faces?

Humans can evaluate similarity between faces much more intuitively than between high-dimensional vectors

What are the disadvantages with using chernoff faces?

Without dimensionality reduction only applicable to data spaces with up to 18 dimensions

Which dimension represents what part?

Why data reduction?

Better perception of patterns

Raw data is hard to understand

Visualization is limited to thousand of objects

Reduction of data may help to identify patterns

Computational complexity

Big data sets cause prohibitively long runtime for data mining algorithms

Reduced data sets are useful the more the algorithms produce the same analytical results

How to approach data reduction?

Numerosity reduction

Data aggregation (basic statistics)

Data generalization (abstraction to higher levels)

Dimensionality reduction (reduction of attributes):

Linear methods: (PCA)

Non-linear methods: (Multidimensional scaling)

Quantization, discretization (reduce number of values per domain):

Binning (histograms)

Generalization along hierarchies (OLAP)

What occurs in data aggregation?

Aggregation is numerosity reduction (less tuples/rows)

Generalisation yields duplicates -> add counting attribute and merge duplicate tuple

What are basic aggregates?

Central tendency: Where is the data located/ where is it centred?

Variation, spread: How much do the data deviate from the centre

How can the value of a distributive measure d be calculated and what are examples?

The value of a distributive measure d on D can be calculated by combining the results of distributed calculations on partitions D_i ⊂ D , D = D1 ∪ D2 ∪ . . . Dn

e.g.

count(D1 ∪ D2) = count(D1) + count(D2)

sum(D1 ∪ D2) = sum(D1) + sum(D2)

min(D1 ∪ D2) = min(min(D1), min(D2))

max(D1 ∪ D2) = max(max(D1), max(D2))

What is an algebraic measure?

An algebraic measure on D can be computed by an algebraic function with M arguments, each of which is obtained by applying a distributive aggregate function to the partitions D_i ⊂ D , D = D1 ∪ D2 ∪ . . . Dn

What are holistic measures?

There is no constant bound on the storage size that is needed to calculate and represent sub-aggregates.

What are the different ways of measuring the central tendency?

Mean - (weighted) arithmetic mean

Mid-range

Median

Mode

How are (weighted) arithmetic mean and mid-range calculated and to what data is it applicable to?

(Weighted) arithmetic mean is calculated by:

Mid-range is calculated by the average of the largest and the smallest values in a data set

Both are algebraic measures

Applicable to numerical data only (sum, scalar multiplication)

To what data is calculating the median applicable to?

Applicable to ordinal data only (data that can be ordered), as total order is required

Median is a holistic measure (considers all values equally)

To what data is calculating the mode best applicable to?

To categorical (non-numerical) data

What does variance do and how is it measured?

The variance measures the spread around the mean of numerical data

The single pass calculation (sum of squares and square of sum in parallel) is faster than the two-pass method

Variance is zero iff. all the values are equal

Standard deviation is equal to the square root of the variance

These are both algebraic measures

What are features of boxpolt analysis?

What is quantization?

Mapping a continuous range of values into a discrete set of values

It may use generalization

e.g. group age (7 bits -> 128 values) to age_range (4 bits -> 16 values only)

How is dimensionality reduction is a borderline case of quantization?

e.g. dropping age corresponds to reduction zero bits (10,14 -> teen, 25,29 -> twen)

corresponds to generalization of age to ‘all’ = ‘any age’ = no information

In data generalization how to group the values of an attribute into partitions?

Use overall data (= single partition)

Overall mean, overall variance are too coarse)

Different techniques to form groups for aggregation

binning - histograms

generalization - abstraction based on hierarchies

clustering - based on object similarity

What binning techniques are there in histograms?

Histograms use binning to approximate data distributions

Divide data into bins are store a representative for each bin

To make equi-width histograms:

divide the range into N intervals of equal size (uniform grid)

If A and B are the lowest and highest values of the attribute, the width of the intervals will be (B-A)/N

What are the benefis and limitations of equi-width histograms?

Benefits:

Most straightforward

Easy to understand

Limitations:

Outliers may dominate presentation

Skewed data is not handled well

What are the benefits and limitations of equi-height histograms?

Good data scaling

Less prone to outliers

Four-bin example is strongly related to boxplot

Key information in bin boundaries (somehow subtle)

If any value occurs often, the equal frequency criterion might not be met

What are examples of concept hierarchies?

Flat groups (no serious hierarchies e.g. name, gender, phone no)

Hierarchies by subgrouping (age -> 15-19, 20-24, 25-30)

Hierarchies from schema information (major -> science/business/engineering)

How can a concept hierarchy for categorical data be established?

Concept hierarchies can be specified by (expert) users

Heuristically generate a hierarchy for a set of (related) attributes

based on the number of distinct values in the attribute set

The attribute with the most distinct values is placed at the lowest level of the hierarchy (e.g. in adress hierarchy, street is the lowest)

Counter-example: 20+ years, 12 months, 7 days of week

What is data generalization?

A process which abstracts a large set of task-relevant data in a database from low conceptual levels to higher ones

What are approaches to data generalization?

Data-cube approach (OLAP / Roll-up) - manual

Attribute-oriented induction (AOI) - automated

What are the basic OLAP operations and what do they do?

Roll up

Summarize data by climbing up hierarchy or by dimension reduction

Drill down

Reverse of roll up. From higher level summary to lower level summary or detailed data, or introducing new dimensions

Slice and dice

Selection on one (slice) or more (dice) dimensions.

Pivot (rotate)

Reorient the cube, visualization, 3D to series of 2D planes

Drill across

involving more than one fact table

Drill through

through the bottom level of the cube to its back-end relational tables (SQL)

Roll up vs Drill Down Example

Slice Operation Example

Dice Operation Example

Pivot Operation Example

How can generalizations be specified by star-nets?

Each circle is called a footprint

Footprints represent the granularities available for OLAP operations

What are the strengths of OLAP-based generalizations?

Efficient implementation of data generalizations

Computation of various kinds of measures e.g. count, sum

Generalization can be performed on a data cube by roll-up (or drill down)

What are limitations of OLAP-based generalisations?

Handles only dimensions of simple non-numeric data and measures of simple aggregated numeric values

Lack of intelligent analysis, does not suggest which dimensions should be used and what levels the generalization should aim at

How is AOI different to OLAP?

More automated approach than OLAP

Apply aggregation by mergin identical, generalized tuples and accumulating their respective fcounts

Builds on data focusing: task-relevant data, including dimensions, resulting in their initial relation

Generates a generalization plan: decide for either attribute removal or attribute generalization

In AOI, what are the 3 choices for each attribute?

Keep it

Attribute Removal, for the following cases

There is a large set of distinct values for A but no generalization operator (concept hierarchy) on A

A’s higher level concepts are expressed in terms of other attributes (street is covered by city,state, country)

Attribute Generalization

If there is a large set of distinct values for A, and there exists a set of generalization operators (concept hierarchy) on A, then select an operator and generalize A

Example of AOI Application

What are the two common approaches to attribute generalization control?

Attribute-threshold control: default or user-specified, typically 2-8 values

Generalized relational threshold control: control the size of the final relation/rule e.g. 10-30 tuples

In the tradeoff of assigning how many distinct values are for an attribute, what are the possible problems?

Overgeneralization: values are too high-level

Undergeneralization: level not sufficiently high

Both yield tuples of poor usefulness

What are strategies for next attribute selection?

Aiming at minimal degree of generalization

choose attribute that reduces the numer of tuples the most

useful heuristic: choose attributes with highest number of distinct values

Aiming at similar degree of generalization for all attributes

choose the attribute currently with least degree of generalization

User-controlled

domain experts may specifiy priorities for selection

Last changed5 months ago