undefined

Buffl

KDD

von Nicholas L.

What is data mining/knowledge discovery in databases?

Extraction of interesting informations or patterns from data in large databases

What are the roots of data mining?

Statistics
Machine Learning
Database Systems
Information Visualization

What are the 3 different types of learnings from Machine Learning?

Descriptive Learning
- Better understanding - data mining
- e.g pattern recognition, clustering, outlier detection
Predictive Learning
- Better forecasts - regression
- e.g. traffic prediction, labeling, fraud detection
Prescriptive Learning
- Better actions - artificial intelligence
- e.g. predictive maintenance, autonomous driving, medical therapies

What is the data explosion problem? What is the solution?

Tremendous amounts of data caused bt
- Automated data collection
- Mature database technology
Solution:
- Data Warehousing and on-line analytical processing (OLAP)
- Data Mining (give definition)

What are potential applications of data mining?

Database analysis and decision support
- Market analysis and management
- Risk analysis and management
- Fraud detection and management
Other applications:
- Text mining
- Intelligent query answering

What is the (Knowledge Discovery in Databases )KDD-Process (Graphic)

What are the processes of data mining?

Frequent Pattern Mining
Clustering
Classification
Regression
Process Mining

What happens in the Data Cleaning & Integration step of the KDD Process?

may take 60% of the effort
Integration of data from different sources
- Mapping of attribute names
- Joining different tables
Elimination of inconsistencies
Elimination of noise
Computation of missing values e.g default value, average value

In the KDD process what is meant by the task which is focused on in task-relevant data?

Task:

Find useful features
Dimensionality/variable reduction
Invariant Representation
Creating a target data set

In the KDD process what are the different methods of focusing on task-relevant data?

Selections
- Selecting the relevant tuples/rows from the database table
Projections
- Selecting the relevant attributes/columns from the database tables
Transformations
- Discretization of numerical attributes
- Computation of derived tuples/rows
  - e.g. aggregation (2+ -> 1), making new attributes (1+ -> 1++)

What is the goal and task of basic data mining tasks?

Goal:

Find patterns of interest

Task:

Are there labels (in the training data)?
- Many -> Supervised learning
- Some few -> Semi-supervised learning
- None -> Unsupervised learning
Choose fitting mining algorithms

What is the setting, motivation, and applications of frequent itemset mining?

Setting:

Given a database of transactions e.g.

Motivation:

Find the frequently co-occurring items in the set of transactions which indicate correlations or causalities

Applications:

Market-based analysis
Cross-marketing
Catalogue design
As a basis for clustering, classification
Association rule mining: determining correlations between different itemsets

What is the setting, motivation, and applications of clustering?

Setting:

Database of objects O
Unknown class labels
Similarity model for objects, oft as function sim: O x O -> R

Task:

Group objects into clusters while maximising intra-cluster similarity (cohesion) and minimising inter-clustering similarity (separation)

Applications:

Customer profiling/segmentation
Document or image collections
Web access patterns

What is the setting, motivation, and applications of classification?

Setting:

Class labels are known for a set of training data

Task:

Find models/functions/rules that
- describe and distinguish the classes
- predict class membership for new objects

Applications:

Classify disease type for samples based on gene expression values
Automatic assignment of categories
Predict unknown or missing values

What is the setting, motivation, and applications of regression?

Setting:

Numerical output values are known for a training set

Task:

Find models/functions/rules that
- describe the numerical output values of the training data
- predict the numerical value for new objects

Applications:

Building a model of the housing values, which can be used to predict the price of a house in a certain area
Building a model of an engineering process as a basis to control a technical system

What is done in making generalisation levels as a basic mining task?

Generalise, summarise, and contrast data characteristics
Based on attribute aggregation along concept hierarchies
- Data cube approach (OLAP)
- Attribute-oriented induction approach

What is outlier detection?

Finding objects which do not comply with the general behaviour of the data
e.g. fraud detection, rare events analysis

What is trends and evolution analysis?

Sequential patterns (finding reoccurring sequences of events)

What are methods for special data types, and applications?

Process mining
Spatial data mining
Graphs
…

What is involved in the Evaluation & Visualization part of the KDD Process?

Pattern evaluation and knowledge presentation

What is data?

Representation of real objects, situations, processes
Measured by physical sensors
Recorded from digital systems
Generated by simulations
Stored and provided by computers

How to represent data?

Numerical and categorical data types
Similarity models -> allow for pattern mining
Data reduction -> to increase efficiency

What are the ingredients of data types?

identifiers (int, long, float)
domains ({1,2,…},{true,false},{‘hello’,’hi’})
operations (+,and,concat)

How can data be stored?

Single values need 1 byte, 1 Byte = 8 bit, 4 Byte = 32 bits
Big data sets allocated Kilobytes,…,Zettabytes
Data compression helps (partially)

What is the operational calculation time for

calculating the Euclidean distance of two objects from R^d
calculating the mean of n objects from R^d
calculating the covariance matrix of n objects x_i in R^d

O(d)
O(nd)
O(nd^2)

What are the different data types? (with examples)

Simple, basic data types (numbers, symbols)
Composed data types (vectors, sets)
Complex data types (multimedia, spatial, structural data)

What is a metric space, and how is it fulfilled?

A metric space (O,d) consists of object set O and metric distance function d: O x O -> R which fulfills:

Symmetry: ∀p,q ∈ O : d(p,q) = d(q,p)
Identity of Indiscernables: ∀p,q ∈ O : d(p,q) = 0 <=> p = q
Triangle Inequality: ∀p,q,o∈ O : d(p,q) <= d(p,o) + d(o,q)

e.g. points in 2D space with Euclidean distance

What are counter examples to metric space?

non-symmetric: one-way roads on the way from p to q
non-definite: indiscernable equivalent objects are not necessarily identical
triangles: sometimes detours are preferred

What is categorical data and how can one compare values?

Symbols/identifiers
e.g. subjects= {“maths”,”physics”}, occupation = {“butcher”}
e.g. trivial metric
- d(p,q) = 0: if p==q, else 1
e.g. explicit similarity matrix

e.g. path length in generalization hierarchy

What are the characteristic fulfillments of ordinal data?

There is a total order ≤ on the set of possible data values O:

Transitivity: ∀p,q,o∈O : p≤q ∧ q≤o => p≤o
Antisymmetry: ∀p,q,o∈O : p≤q ∧ q≤p =⇒ p = q
Totality: ∀p,q∈O : p≤q ∨ q≤p

Examples:

words, lexicographic ordering: high <= highschool
(vague) sizes: tiny <= small <= medium <= big
frequencies: never <= seldom <= rarely <= sometimes
numbers: 0<=1, 1.78 <= 2.35

What are the characteristics of composed data types?

Given a domain D, a sequence s of length n is a mapping I_n -> D of the index set I_n = {1,…,n} into D, s ∈ D^n for short.
The sequence s concatenates n values from D; the order does not matter

What are the characteristics of set data types and how can one compare values?

Unordered collection of individual values
e.g. skills = {Java, C, Python}, hobbies = {skiing, diving, hiking}
comparison

+ hamming distance (sep card)

How can one use bitvector representations to measure hamming distances between sets?

Given an ordered base set B = (b1,…,bn), for any set S create a binary vector r ∈ {0,1}^n with r_i = 1 <=> b_i ∈ S.
Hamming distance: Sum of different entries (equal cardinality of symmetric set difference)

What are the complex data types and what are examples of them?

Structure: graphs, networks, trees
Geometry: shaped, contours, routes, trajectories
Multimedia: images, audio, text, etc.

What is the approach for similarity models between complex data types?

Direct measures: highly data type dependent
Feature extraction/feature engineering: explicit vector space embedding with hand-crafted features
Kernel trick: implicit vector space embeddings
Feature learning: vector space embeddings learned by deep learning methods e.g. neural networks

How can objects and attributes be conceptually modelled?

What occurs in feature extraction?

Objects from database DB are mapped to feature vectors
Feature vector space
- Points represent objects
- Distance corresponds to (dis)similarity

What are similarity queries and what are the different queries?

Similarity queries are basic operations in databases
Given: universe O, database DB ⊆ O, distance function d and query object q ∈ O
Range query
(k-) Nearest neighbor query
Ranking query

How is a range query calculated?

Range query for range parameter ε ∈ R^+_0:

range(DB, q, d, ε) = {o ∈ DB | d(o,q) ≤ ε}

How is a nearest neighbor query calculated?

Nearest neighbor query:

NN(DB,q,d) = {o ∈ DB | ∀o′ ∈ DB : d(o,q) ≤ d(o′,q)}

How is a k-nearest neighbour query calculated?

NN(DB,q,d,k) ⊆ DB with |NN(DB,q,d,k)| = k AND ∀o∈NN(DB,q,d,k),

o′∈DB−NN(DB,q,d,k)

:d(o,q)≤d(o′,q)

The function works by first calculating the distance between q and each object in DB. Then, it sorts the objects in DB by their distance to q. The k objects with the smallest distances are then returned as the nearest neighbors of q.

The function guarantees that the returned set of nearest neighbors will contain exactly k objects, and that all of the objects in the returned set will be closer to q than any other object in DB that is not in the returned set.

How is a ranking query calculated?

“get next” functionality for picking databse objects in an increasing order w.r.t their distance to q:

∀i ≤ j : d(q, rank_{DB,q,d} (i)) ≤ d(q, rank_{DB,q,d (j))}

How might a similarity search be executed naively?

Using a range query:

Naive search by sequential scan
- Fetch database objects from secondary storage (e.g. disk): O(n) time
- Check distances individually: O(n) time

How might a similarity search be executed quickly (faster than naive)?

Using a range query:

Fast search by applying database techniques
- Filter-refine architecture
  - Filter: Boil DB down to (small) candidate set C ⊆ DB
  - Refine: Apply exact distance calculations to candidates from C only
- Indexing structures
  - Avoid sequential scans by (hierarchical or other) index techniques
  - Data access in time O(n), O(log n) or even O(1)

In filter-refine architecture, what is the principle of multi-step search? What is an example?

Fast filter step produces candidate set C ⊆ DB (by approximating distance function d’)
Exact distance function d is calculated on candidate set C only

Example: Dimensionality Reduction

What is the ICES criteria for filter quality?

Indexable - Index enabled

Complete - No false dismissals

Efficient - Fast individual calculation

Selective - Small candidate set

What is the principle of indexing?

To organize data in a way for fast access to relevant objects, e.g. by heavy pruning.
R-Tree as an example for spatial index structure:
- Hierarchy of minimum bounding rectangles
- Disregard subtrees which are not relevant for the current query region

What does data visualization do?

Data visualization transforms data in visually perceivable representations
Combines the capabilities:
- Computers are good in number crunching
- Humans are good in visual pattern recognition

What are the 4 types of data visualization techniques?

Geometric
Icon-based
Pixel-oriented
Other

What is in geometric data visualization techniques?

Visualization of geometric transformations and projections of the data
e.g. Scatterplots, Parallel Coordinates

What is in icon-based data visualization techniques?

Visualization of data as icons
e.g. Chernoff Faces, Stick Figures

What is in pixel-oriented data visualization techniques?

Visualize each attribute value of each data object by one coloured pixel
e.g. recursive patterns

What are other examples of data visualization techniques?

Hierarchical techniques, graph-based techniques, hybrid techniques

What are characterstics of 2D scatterplots?

Designed for 2D data, two axes span the drawing area
2D objects are depicted as points in the diagram
Orthogonal projections of points to axes indicate coordinate values
Provides a first look at bivariate data to see clusters of points, outliers, etc.

What are the characteristics of a scatterplot matrix?

Matrix of scatterplots for pairs of dimensions
Ordering of dimensions is important:
- reordering improves understanding of structures and reduces clutter
- Interestingness of orderings can be evaluated with quality metrics

What are limitations of scatterplots?

Scatterplots work well for 2 dimensions only
Only depict pairwise correlations
Projections get ambiguous for 3 or more dimensions

What are polygonal plots?

Depict objects by curves rather than by points
Coordinates are not indicated by orthogonal projections but by explicit intersections of polygons with axes
e.g. spiderweb - axes remain connected at origin
e.g. parallel coordinates - axes are aligned in parallel

What are characteristics of the spiderweb model?

Illustrate any single object by a polygonal line
Contract origins of all axes to a global origin point
Works well for few objects only
e.g. monthly average temperature

What are characteristics of parallel coordinates?

d-dimensional data space is visualised by d parallel axes
each axis is scaled to min-max range
object = polygonal line intersecting axis at value in this dimension

ordering important to reveal correlations

What are characteristics of pixel-oriented techniques?

Each data value is mapped onto a colored pixel
Each dimension is showin in a separate window
Arrangement strategies: recursive patterns iterated line and column-based arrangements

What are characteristics of chernoff faces?

Map d-dimensional space to facial expression, e.g. length of nose = dim 6; curvature of mouth = dim 8

What is the advantage with using chernoff faces?

Humans can evaluate similarity between faces much more intuitively than between high-dimensional vectors

What are the disadvantages with using chernoff faces?

Without dimensionality reduction only applicable to data spaces with up to 18 dimensions
Which dimension represents what part?

Why data reduction?

Better perception of patterns
- Raw data is hard to understand
- Visualization is limited to thousand of objects
- Reduction of data may help to identify patterns
Computational complexity
- Big data sets cause prohibitively long runtime for data mining algorithms
- Reduced data sets are useful the more the algorithms produce the same analytical results

How to approach data reduction?

Numerosity reduction

Data aggregation (basic statistics)
Data generalization (abstraction to higher levels)

Dimensionality reduction (reduction of attributes):

Linear methods: (PCA)
Non-linear methods: (Multidimensional scaling)

Quantization, discretization (reduce number of values per domain):

Binning (histograms)
Generalization along hierarchies (OLAP)

What occurs in data aggregation?

Aggregation is numerosity reduction (less tuples/rows)
Generalisation yields duplicates -> add counting attribute and merge duplicate tuple

What are basic aggregates?

Central tendency: Where is the data located/ where is it centred?
Variation, spread: How much do the data deviate from the centre

How can the value of a distributive measure d be calculated and what are examples?

The value of a distributive measure d on D can be calculated by combining the results of distributed calculations on partitions D_i ⊂ D , D = D1 ∪ D2 ∪ . . . Dn

e.g.

count(D1 ∪ D2) = count(D1) + count(D2)
sum(D1 ∪ D2) = sum(D1) + sum(D2)
min(D1 ∪ D2) = min(min(D1), min(D2))
max(D1 ∪ D2) = max(max(D1), max(D2))

What is an algebraic measure?

An algebraic measure on D can be computed by an algebraic function with M arguments, each of which is obtained by applying a distributive aggregate function to the partitions D_i ⊂ D , D = D1 ∪ D2 ∪ . . . Dn

What are holistic measures?

There is no constant bound on the storage size that is needed to calculate and represent sub-aggregates.

What are the different ways of measuring the central tendency?

Mean - (weighted) arithmetic mean
Mid-range
Median
Mode

How are (weighted) arithmetic mean and mid-range calculated and to what data is it applicable to?

(Weighted) arithmetic mean is calculated by:

Mid-range is calculated by the average of the largest and the smallest values in a data set
Both are algebraic measures
Applicable to numerical data only (sum, scalar multiplication)

To what data is calculating the median applicable to?

Applicable to ordinal data only (data that can be ordered), as total order is required
Median is a holistic measure (considers all values equally)

To what data is calculating the mode best applicable to?

To categorical (non-numerical) data

What does variance do and how is it measured?

The variance measures the spread around the mean of numerical data

The single pass calculation (sum of squares and square of sum in parallel) is faster than the two-pass method
Variance is zero iff. all the values are equal
Standard deviation is equal to the square root of the variance
- These are both algebraic measures

What are features of boxpolt analysis?

What is quantization?

Mapping a continuous range of values into a discrete set of values
It may use generalization
e.g. group age (7 bits -> 128 values) to age_range (4 bits -> 16 values only)

How is dimensionality reduction is a borderline case of quantization?

e.g. dropping age corresponds to reduction zero bits (10,14 -> teen, 25,29 -> twen)
corresponds to generalization of age to ‘all’ = ‘any age’ = no information

In data generalization how to group the values of an attribute into partitions?

Use overall data (= single partition)
- Overall mean, overall variance are too coarse)
Different techniques to form groups for aggregation
- binning - histograms
- generalization - abstraction based on hierarchies
- clustering - based on object similarity

What binning techniques are there in histograms?

Histograms use binning to approximate data distributions
Divide data into bins are store a representative for each bin
To make equi-width histograms:
- divide the range into N intervals of equal size (uniform grid)
- If A and B are the lowest and highest values of the attribute, the width of the intervals will be (B-A)/N

What are the benefis and limitations of equi-width histograms?

Benefits:

Most straightforward
Easy to understand

Limitations:

Outliers may dominate presentation
Skewed data is not handled well

What are the benefits and limitations of equi-height histograms?

Benefits:

Good data scaling
Less prone to outliers
Four-bin example is strongly related to boxplot

Limitations:

Key information in bin boundaries (somehow subtle)
If any value occurs often, the equal frequency criterion might not be met

What are examples of concept hierarchies?

Flat groups (no serious hierarchies e.g. name, gender, phone no)
Hierarchies by subgrouping (age -> 15-19, 20-24, 25-30)
Hierarchies from schema information (major -> science/business/engineering)

How can a concept hierarchy for categorical data be established?

Concept hierarchies can be specified by (expert) users
Heuristically generate a hierarchy for a set of (related) attributes
- based on the number of distinct values in the attribute set
- The attribute with the most distinct values is placed at the lowest level of the hierarchy (e.g. in adress hierarchy, street is the lowest)
Counter-example: 20+ years, 12 months, 7 days of week

What is data generalization?

A process which abstracts a large set of task-relevant data in a database from low conceptual levels to higher ones

What are approaches to data generalization?

Data-cube approach (OLAP / Roll-up) - manual
Attribute-oriented induction (AOI) - automated

What are the basic OLAP operations and what do they do?

Roll up
- Summarize data by climbing up hierarchy or by dimension reduction
Drill down
- Reverse of roll up. From higher level summary to lower level summary or detailed data, or introducing new dimensions
Slice and dice
- Selection on one (slice) or more (dice) dimensions.
Pivot (rotate)
- Reorient the cube, visualization, 3D to series of 2D planes
Drill across
- involving more than one fact table
Drill through
- through the bottom level of the cube to its back-end relational tables (SQL)

Roll up vs Drill Down Example

Slice Operation Example

Dice Operation Example

Pivot Operation Example

How can generalizations be specified by star-nets?

Each circle is called a footprint
Footprints represent the granularities available for OLAP operations

What are the strengths of OLAP-based generalizations?

Efficient implementation of data generalizations
Computation of various kinds of measures e.g. count, sum
Generalization can be performed on a data cube by roll-up (or drill down)

What are limitations of OLAP-based generalisations?

Handles only dimensions of simple non-numeric data and measures of simple aggregated numeric values
Lack of intelligent analysis, does not suggest which dimensions should be used and what levels the generalization should aim at

How is AOI different to OLAP?

More automated approach than OLAP
Apply aggregation by mergin identical, generalized tuples and accumulating their respective fcounts
Builds on data focusing: task-relevant data, including dimensions, resulting in their initial relation
Generates a generalization plan: decide for either attribute removal or attribute generalization

In AOI, what are the 3 choices for each attribute?

Keep it
Attribute Removal, for the following cases
- There is a large set of distinct values for A but no generalization operator (concept hierarchy) on A
- A’s higher level concepts are expressed in terms of other attributes (street is covered by city,state, country)
Attribute Generalization
- If there is a large set of distinct values for A, and there exists a set of generalization operators (concept hierarchy) on A, then select an operator and generalize A

Example of AOI Application

What are the two common approaches to attribute generalization control?

Attribute-threshold control: default or user-specified, typically 2-8 values
Generalized relational threshold control: control the size of the final relation/rule e.g. 10-30 tuples

In the tradeoff of assigning how many distinct values are for an attribute, what are the possible problems?

Overgeneralization: values are too high-level
Undergeneralization: level not sufficiently high

Both yield tuples of poor usefulness

What are strategies for next attribute selection?

Aiming at minimal degree of generalization
- choose attribute that reduces the numer of tuples the most
- useful heuristic: choose attributes with highest number of distinct values
Aiming at similar degree of generalization for all attributes
- choose the attribute currently with least degree of generalization
User-controlled
- domain experts may specifiy priorities for selection

Beitreten

Vorschau

Author

Nicholas L.

Informationen

Zuletzt geändert
vor 2 Jahren

Kurs melden

Chapters 1,2

Author

Nicholas L.

Informationen