What is data mining/knowledge discovery in databases?
Extraction of interesting informations or patterns from data in large databases
What are the roots of data mining?
Statistics
Machine Learning
Database Systems
Information Visualization
What are the 3 different types of learnings from Machine Learning?
Descriptive Learning
Better understanding - data mining
e.g pattern recognition, clustering, outlier detection
Predictive Learning
Better forecasts - regression
e.g. traffic prediction, labeling, fraud detection
Prescriptive Learning
Better actions - artificial intelligence
e.g. predictive maintenance, autonomous driving, medical therapies
What is the data explosion problem? What is the solution?
Tremendous amounts of data caused bt
Automated data collection
Mature database technology
Solution:
Data Warehousing and on-line analytical processing (OLAP)
Data Mining (give definition)
What are potential applications of data mining?
Database analysis and decision support
Market analysis and management
Risk analysis and management
Fraud detection and management
Other applications:
Text mining
Intelligent query answering
What is the (Knowledge Discovery in Databases )KDD-Process (Graphic)
What are the processes of data mining?
Frequent Pattern Mining
Clustering
Classification
Regression
Process Mining
What happens in the Data Cleaning & Integration step of the KDD Process?
may take 60% of the effort
Integration of data from different sources
Mapping of attribute names
Joining different tables
Elimination of inconsistencies
Elimination of noise
Computation of missing values e.g default value, average value
In the KDD process what is meant by the task which is focused on in task-relevant data?
Task:
Find useful features
Dimensionality/variable reduction
Invariant Representation
Creating a target data set
In the KDD process what are the different methods of focusing on task-relevant data?
Selections
Selecting the relevant tuples/rows from the database table
Projections
Selecting the relevant attributes/columns from the database tables
Transformations
Discretization of numerical attributes
Computation of derived tuples/rows
e.g. aggregation (2+ -> 1), making new attributes (1+ -> 1++)
What is the goal and task of basic data mining tasks?
Goal:
Find patterns of interest
Are there labels (in the training data)?
Many -> Supervised learning
Some few -> Semi-supervised learning
None -> Unsupervised learning
Choose fitting mining algorithms
What is the setting, motivation, and applications of frequent itemset mining?
Setting:
Given a database of transactions e.g.
Motivation:
Find the frequently co-occurring items in the set of transactions which indicate correlations or causalities
Applications:
Market-based analysis
Cross-marketing
Catalogue design
As a basis for clustering, classification
Association rule mining: determining correlations between different itemsets
What is the setting, motivation, and applications of clustering?
Database of objects O
Unknown class labels
Similarity model for objects, oft as function sim: O x O -> R
Group objects into clusters while maximising intra-cluster similarity (cohesion) and minimising inter-clustering similarity (separation)
Customer profiling/segmentation
Document or image collections
Web access patterns
What is the setting, motivation, and applications of classification?
Class labels are known for a set of training data
Find models/functions/rules that
describe and distinguish the classes
predict class membership for new objects
Classify disease type for samples based on gene expression values
Automatic assignment of categories
Predict unknown or missing values
What is the setting, motivation, and applications of regression?
Numerical output values are known for a training set
describe the numerical output values of the training data
predict the numerical value for new objects
Building a model of the housing values, which can be used to predict the price of a house in a certain area
Building a model of an engineering process as a basis to control a technical system
What is done in making generalisation levels as a basic mining task?
Generalise, summarise, and contrast data characteristics
Based on attribute aggregation along concept hierarchies
Data cube approach (OLAP)
Attribute-oriented induction approach
What is outlier detection?
Finding objects which do not comply with the general behaviour of the data
e.g. fraud detection, rare events analysis
What is trends and evolution analysis?
Sequential patterns (finding reoccurring sequences of events)
What are methods for special data types, and applications?
Process mining
Spatial data mining
Graphs
…
What is involved in the Evaluation & Visualization part of the KDD Process?
Pattern evaluation and knowledge presentation
What is data?
Representation of real objects, situations, processes
Measured by physical sensors
Recorded from digital systems
Generated by simulations
Stored and provided by computers
How to represent data?
Numerical and categorical data types
Similarity models -> allow for pattern mining
Data reduction -> to increase efficiency
What are the ingredients of data types?
identifiers (int, long, float)
domains ({1,2,…},{true,false},{‘hello’,’hi’})
operations (+,and,concat)
How can data be stored?
Single values need 1 byte, 1 Byte = 8 bit, 4 Byte = 32 bits
Big data sets allocated Kilobytes,…,Zettabytes
Data compression helps (partially)
What is the operational calculation time for
calculating the Euclidean distance of two objects from R^d
calculating the mean of n objects from R^d
calculating the covariance matrix of n objects x_i in R^d
O(d)
O(nd)
O(nd^2)
What are the different data types? (with examples)
Simple, basic data types (numbers, symbols)
Composed data types (vectors, sets)
Complex data types (multimedia, spatial, structural data)
What is a metric space, and how is it fulfilled?
A metric space (O,d) consists of object set O and metric distance function d: O x O -> R which fulfills:
Symmetry: ∀p,q ∈ O : d(p,q) = d(q,p)
Identity of Indiscernables: ∀p,q ∈ O : d(p,q) = 0 <=> p = q
Triangle Inequality: ∀p,q,o∈ O : d(p,q) <= d(p,o) + d(o,q)
e.g. points in 2D space with Euclidean distance
What are counter examples to metric space?
non-symmetric: one-way roads on the way from p to q
non-definite: indiscernable equivalent objects are not necessarily identical
triangles: sometimes detours are preferred
What is categorical data and how can one compare values?
Symbols/identifiers
e.g. subjects= {“maths”,”physics”}, occupation = {“butcher”}
e.g. trivial metric
d(p,q) = 0: if p==q, else 1
e.g. explicit similarity matrix
e.g. path length in generalization hierarchy
What are the characteristic fulfillments of ordinal data?
There is a total order ≤ on the set of possible data values O:
Transitivity: ∀p,q,o∈O : p≤q ∧ q≤o => p≤o
Antisymmetry: ∀p,q,o∈O : p≤q ∧ q≤p =⇒ p = q
Totality: ∀p,q∈O : p≤q ∨ q≤p
Examples:
words, lexicographic ordering: high <= highschool
(vague) sizes: tiny <= small <= medium <= big
frequencies: never <= seldom <= rarely <= sometimes
numbers: 0<=1, 1.78 <= 2.35
What are the characteristics of composed data types?
Given a domain D, a sequence s of length n is a mapping I_n -> D of the index set I_n = {1,…,n} into D, s ∈ D^n for short.
The sequence s concatenates n values from D; the order does not matter
What are the characteristics of set data types and how can one compare values?
Unordered collection of individual values
e.g. skills = {Java, C, Python}, hobbies = {skiing, diving, hiking}
comparison
+ hamming distance (sep card)
How can one use bitvector representations to measure hamming distances between sets?
Given an ordered base set B = (b1,…,bn), for any set S create a binary vector r ∈ {0,1}^n with r_i = 1 <=> b_i ∈ S.
Hamming distance: Sum of different entries (equal cardinality of symmetric set difference)
What are the complex data types and what are examples of them?
Structure: graphs, networks, trees
Geometry: shaped, contours, routes, trajectories
Multimedia: images, audio, text, etc.
What is the approach for similarity models between complex data types?
Direct measures: highly data type dependent
Feature extraction/feature engineering: explicit vector space embedding with hand-crafted features
Kernel trick: implicit vector space embeddings
Feature learning: vector space embeddings learned by deep learning methods e.g. neural networks
How can objects and attributes be conceptually modelled?
What occurs in feature extraction?
Objects from database DB are mapped to feature vectors
Feature vector space
Points represent objects
Distance corresponds to (dis)similarity
What are similarity queries and what are the different queries?
Similarity queries are basic operations in databases
Given: universe O, database DB ⊆ O, distance function d and query object q ∈ O
Range query
(k-) Nearest neighbor query
Ranking query
How is a range query calculated?
Range query for range parameter ε ∈ R^+_0:
range(DB, q, d, ε) = {o ∈ DB | d(o,q) ≤ ε}
How is a nearest neighbor query calculated?
Nearest neighbor query:
NN(DB,q,d) = {o ∈ DB | ∀o′ ∈ DB : d(o,q) ≤ d(o′,q)}
How is a k-nearest neighbour query calculated?
NN(DB,q,d,k) ⊆ DB with |NN(DB,q,d,k)| = k AND ∀o∈NN(DB,q,d,k),
o′∈DB−NN(DB,q,d,k)
:d(o,q)≤d(o′,q)
#
The function works by first calculating the distance between q and each object in DB. Then, it sorts the objects in DB by their distance to q. The k objects with the smallest distances are then returned as the nearest neighbors of q.
The function guarantees that the returned set of nearest neighbors will contain exactly k objects, and that all of the objects in the returned set will be closer to q than any other object in DB that is not in the returned set.
How is a ranking query calculated?
“get next” functionality for picking databse objects in an increasing order w.r.t their distance to q:
∀i ≤ j : d(q, rank_{DB,q,d} (i)) ≤ d(q, rank_{DB,q,d (j))}
How might a similarity search be executed naively?
Using a range query:
Naive search by sequential scan
Fetch database objects from secondary storage (e.g. disk): O(n) time
Check distances individually: O(n) time
How might a similarity search be executed quickly (faster than naive)?
Fast search by applying database techniques
Filter-refine architecture
Filter: Boil DB down to (small) candidate set C ⊆ DB
Refine: Apply exact distance calculations to candidates from C only
Indexing structures
Avoid sequential scans by (hierarchical or other) index techniques
Data access in time O(n), O(log n) or even O(1)
In filter-refine architecture, what is the principle of multi-step search? What is an example?
Fast filter step produces candidate set C ⊆ DB (by approximating distance function d’)
Exact distance function d is calculated on candidate set C only
Example: Dimensionality Reduction
What is the ICES criteria for filter quality?
Indexable - Index enabled
Complete - No false dismissals
Efficient - Fast individual calculation
Selective - Small candidate set
What is the principle of indexing?
To organize data in a way for fast access to relevant objects, e.g. by heavy pruning.
R-Tree as an example for spatial index structure:
Hierarchy of minimum bounding rectangles
Disregard subtrees which are not relevant for the current query region
What does data visualization do?
Data visualization transforms data in visually perceivable representations
Combines the capabilities:
Computers are good in number crunching
Humans are good in visual pattern recognition
What are the 4 types of data visualization techniques?
Geometric
Icon-based
Pixel-oriented
Other
What is in geometric data visualization techniques?
Visualization of geometric transformations and projections of the data
e.g. Scatterplots, Parallel Coordinates
What is in icon-based data visualization techniques?
Visualization of data as icons
e.g. Chernoff Faces, Stick Figures
What is in pixel-oriented data visualization techniques?
Visualize each attribute value of each data object by one coloured pixel
e.g. recursive patterns
What are other examples of data visualization techniques?
Hierarchical techniques, graph-based techniques, hybrid techniques
What are characterstics of 2D scatterplots?
Designed for 2D data, two axes span the drawing area
2D objects are depicted as points in the diagram
Orthogonal projections of points to axes indicate coordinate values
Provides a first look at bivariate data to see clusters of points, outliers, etc.
What are the characteristics of a scatterplot matrix?
Matrix of scatterplots for pairs of dimensions
Ordering of dimensions is important:
reordering improves understanding of structures and reduces clutter
Interestingness of orderings can be evaluated with quality metrics
What are limitations of scatterplots?
Scatterplots work well for 2 dimensions only
Only depict pairwise correlations
Projections get ambiguous for 3 or more dimensions
What are polygonal plots?
Depict objects by curves rather than by points
Coordinates are not indicated by orthogonal projections but by explicit intersections of polygons with axes
e.g. spiderweb - axes remain connected at origin
e.g. parallel coordinates - axes are aligned in parallel
What are characteristics of the spiderweb model?
Illustrate any single object by a polygonal line
Contract origins of all axes to a global origin point
Works well for few objects only
e.g. monthly average temperature
What are characteristics of parallel coordinates?
d-dimensional data space is visualised by d parallel axes
each axis is scaled to min-max range
object = polygonal line intersecting axis at value in this dimension
ordering important to reveal correlations
What are characteristics of pixel-oriented techniques?
Each data value is mapped onto a colored pixel
Each dimension is showin in a separate window
Arrangement strategies: recursive patterns iterated line and column-based arrangements
What are characteristics of chernoff faces?
Map d-dimensional space to facial expression, e.g. length of nose = dim 6; curvature of mouth = dim 8
What is the advantage with using chernoff faces?
Humans can evaluate similarity between faces much more intuitively than between high-dimensional vectors
What are the disadvantages with using chernoff faces?
Without dimensionality reduction only applicable to data spaces with up to 18 dimensions
Which dimension represents what part?
Why data reduction?
Better perception of patterns
Raw data is hard to understand
Visualization is limited to thousand of objects
Reduction of data may help to identify patterns
Computational complexity
Big data sets cause prohibitively long runtime for data mining algorithms
Reduced data sets are useful the more the algorithms produce the same analytical results
How to approach data reduction?
Numerosity reduction
Data aggregation (basic statistics)
Data generalization (abstraction to higher levels)
Dimensionality reduction (reduction of attributes):
Linear methods: (PCA)
Non-linear methods: (Multidimensional scaling)
Quantization, discretization (reduce number of values per domain):
Binning (histograms)
Generalization along hierarchies (OLAP)
What occurs in data aggregation?
Aggregation is numerosity reduction (less tuples/rows)
Generalisation yields duplicates -> add counting attribute and merge duplicate tuple
What are basic aggregates?
Central tendency: Where is the data located/ where is it centred?
Variation, spread: How much do the data deviate from the centre
How can the value of a distributive measure d be calculated and what are examples?
The value of a distributive measure d on D can be calculated by combining the results of distributed calculations on partitions D_i ⊂ D , D = D1 ∪ D2 ∪ . . . Dn
e.g.
count(D1 ∪ D2) = count(D1) + count(D2)
sum(D1 ∪ D2) = sum(D1) + sum(D2)
min(D1 ∪ D2) = min(min(D1), min(D2))
max(D1 ∪ D2) = max(max(D1), max(D2))
What is an algebraic measure?
An algebraic measure on D can be computed by an algebraic function with M arguments, each of which is obtained by applying a distributive aggregate function to the partitions D_i ⊂ D , D = D1 ∪ D2 ∪ . . . Dn
What are holistic measures?
There is no constant bound on the storage size that is needed to calculate and represent sub-aggregates.
What are the different ways of measuring the central tendency?
Mean - (weighted) arithmetic mean
Mid-range
Median
Mode
How are (weighted) arithmetic mean and mid-range calculated and to what data is it applicable to?
(Weighted) arithmetic mean is calculated by:
Mid-range is calculated by the average of the largest and the smallest values in a data set
Both are algebraic measures
Applicable to numerical data only (sum, scalar multiplication)
To what data is calculating the median applicable to?
Applicable to ordinal data only (data that can be ordered), as total order is required
Median is a holistic measure (considers all values equally)
To what data is calculating the mode best applicable to?
To categorical (non-numerical) data
What does variance do and how is it measured?
The variance measures the spread around the mean of numerical data
The single pass calculation (sum of squares and square of sum in parallel) is faster than the two-pass method
Variance is zero iff. all the values are equal
Standard deviation is equal to the square root of the variance
These are both algebraic measures
What are features of boxpolt analysis?
What is quantization?
Mapping a continuous range of values into a discrete set of values
It may use generalization
e.g. group age (7 bits -> 128 values) to age_range (4 bits -> 16 values only)
How is dimensionality reduction is a borderline case of quantization?
e.g. dropping age corresponds to reduction zero bits (10,14 -> teen, 25,29 -> twen)
corresponds to generalization of age to ‘all’ = ‘any age’ = no information
In data generalization how to group the values of an attribute into partitions?
Use overall data (= single partition)
Overall mean, overall variance are too coarse)
Different techniques to form groups for aggregation
binning - histograms
generalization - abstraction based on hierarchies
clustering - based on object similarity
What binning techniques are there in histograms?
Histograms use binning to approximate data distributions
Divide data into bins are store a representative for each bin
To make equi-width histograms:
divide the range into N intervals of equal size (uniform grid)
If A and B are the lowest and highest values of the attribute, the width of the intervals will be (B-A)/N
What are the benefis and limitations of equi-width histograms?
Benefits:
Most straightforward
Easy to understand
Limitations:
Outliers may dominate presentation
Skewed data is not handled well
What are the benefits and limitations of equi-height histograms?
Good data scaling
Less prone to outliers
Four-bin example is strongly related to boxplot
Key information in bin boundaries (somehow subtle)
If any value occurs often, the equal frequency criterion might not be met
What are examples of concept hierarchies?
Flat groups (no serious hierarchies e.g. name, gender, phone no)
Hierarchies by subgrouping (age -> 15-19, 20-24, 25-30)
Hierarchies from schema information (major -> science/business/engineering)
How can a concept hierarchy for categorical data be established?
Concept hierarchies can be specified by (expert) users
Heuristically generate a hierarchy for a set of (related) attributes
based on the number of distinct values in the attribute set
The attribute with the most distinct values is placed at the lowest level of the hierarchy (e.g. in adress hierarchy, street is the lowest)
Counter-example: 20+ years, 12 months, 7 days of week
What is data generalization?
A process which abstracts a large set of task-relevant data in a database from low conceptual levels to higher ones
What are approaches to data generalization?
Data-cube approach (OLAP / Roll-up) - manual
Attribute-oriented induction (AOI) - automated
What are the basic OLAP operations and what do they do?
Roll up
Summarize data by climbing up hierarchy or by dimension reduction
Drill down
Reverse of roll up. From higher level summary to lower level summary or detailed data, or introducing new dimensions
Slice and dice
Selection on one (slice) or more (dice) dimensions.
Pivot (rotate)
Reorient the cube, visualization, 3D to series of 2D planes
Drill across
involving more than one fact table
Drill through
through the bottom level of the cube to its back-end relational tables (SQL)
Roll up vs Drill Down Example
Slice Operation Example
Dice Operation Example
Pivot Operation Example
How can generalizations be specified by star-nets?
Each circle is called a footprint
Footprints represent the granularities available for OLAP operations
What are the strengths of OLAP-based generalizations?
Efficient implementation of data generalizations
Computation of various kinds of measures e.g. count, sum
Generalization can be performed on a data cube by roll-up (or drill down)
What are limitations of OLAP-based generalisations?
Handles only dimensions of simple non-numeric data and measures of simple aggregated numeric values
Lack of intelligent analysis, does not suggest which dimensions should be used and what levels the generalization should aim at
How is AOI different to OLAP?
More automated approach than OLAP
Apply aggregation by mergin identical, generalized tuples and accumulating their respective fcounts
Builds on data focusing: task-relevant data, including dimensions, resulting in their initial relation
Generates a generalization plan: decide for either attribute removal or attribute generalization
In AOI, what are the 3 choices for each attribute?
Keep it
Attribute Removal, for the following cases
There is a large set of distinct values for A but no generalization operator (concept hierarchy) on A
A’s higher level concepts are expressed in terms of other attributes (street is covered by city,state, country)
Attribute Generalization
If there is a large set of distinct values for A, and there exists a set of generalization operators (concept hierarchy) on A, then select an operator and generalize A
Example of AOI Application
What are the two common approaches to attribute generalization control?
Attribute-threshold control: default or user-specified, typically 2-8 values
Generalized relational threshold control: control the size of the final relation/rule e.g. 10-30 tuples
In the tradeoff of assigning how many distinct values are for an attribute, what are the possible problems?
Overgeneralization: values are too high-level
Undergeneralization: level not sufficiently high
Both yield tuples of poor usefulness
What are strategies for next attribute selection?
Aiming at minimal degree of generalization
choose attribute that reduces the numer of tuples the most
useful heuristic: choose attributes with highest number of distinct values
Aiming at similar degree of generalization for all attributes
choose the attribute currently with least degree of generalization
User-controlled
domain experts may specifiy priorities for selection
Last changeda year ago