undefined

Buffl

Data Science

by Felix S.

Which are the unsupervised methods?

explain difference between supervised and unsupervised learning

supervised

goal is predict outcome for new data, input data is provided to the model along with the output in the supervised learning. you know up front the type of results to expect. the output is predictet by the supversied learning model

unsupervised

goal is get insights from large volumes of new data. the ml itself deteermines what is different or intetesting from the dataset. hiden patterns in the data can be found using the unsupervised learning model

Decision Tree is

Neural Network is

K-Means is

SVM is

Clustering is

Clustering Definition

unsupervised
collect elements into segments with similar characteristics

explain the 3 approaches of clustering algorithms

partitional algorithms
hierarchical algorithms
density based algorithms

partitional algorithms
- k means clustering, model based clustering
- needs a fix k before
- start with random partitioning
- sensitive to initialization
- fast and efficient
- problems when clusters are different size and outliers
hierarchical algorithms
- bottom up, top down
- no particulare numver of clusters
- computational complex in time and space
- pros and cons depend on method: sensitive to outliers, handling sizes
density based algorithms
- can handle clusters of different shapes and sizes
- resistant to noice
- dont work if density varies a lot and high dimensional data

K-Means

Steps

types of clustering

model based
hierarchical

model based
- bases on probabilistic model (like K means)
- clustering by EM (Expectation Maximazation)
- need number of cluster before start
hierarchical
- Building a Dendogram (trees)
  - agglomerative -> bottom up -> each obs. starts in its own cluster -> merged as one move up hierarchy
  - divisive -> top down -> all obs. start in one cluster -> splits moves down hierarchy

Clustering can be used for..

detect outliers
segment customers
find like-minded users
analyze social networks
midicine and biology

Explain the bias variance trade off

high bias
- oversimplifying
- underfitting
variance
- overcomplex
- overfitting

when the model suffers from

high bias
- the avg response of the model is far from the true value and this is called underfitting
high variance
- this is usually result of its inability to generalize well beyong the training data and this is called overfitting
Build a model that achieves a balance between bias and variance -> combined error at minimum

Watch the Video on slide 26

what is support
what is confidence
what is lift

confidence
- measures of the quality of a given rule
- Support(X∪Y) // Support(X)
Support
- tells us which proportion of transactions from a dataset include items from both LHS and RHS
- Support(X)=Gesamtzahl der TransaktionenAnzahl der Transaktionen // die X enthalten
Confidence
- expresses which proportion of transactions include items from LHS also include items from RHS
- Lift = Support(X∪Y) // Support(x) * Support (y)

what is association rules?

Identify user behaviour by finding associations and correlations between different items in a basket
LHS (lefthand side) -> RHS[support, confidence) (righthand side)

Definition of classification

supervised
analyze historical data
generate model to predict future
e.g. decision trees, neural network, svm

Basic Idea of Classification (Hunt’s Algorithm)

select the most dicriminatory feature
split the entire set into subsets using the feature
recursively find the most significant feature for each subset

What types of classification approaches do you know?

neural networks
bayesian networks
decision trees
support vector machines
genetic algorithm

Decision Trees

steps of the algorithm

create root node and select splitting attribute
add branch to root node for each split candidate value und label
take following iterative steps
1. classify data by applying split value
2. if stopping points is reached, then create leaf node and label it -> otherwise build another subtree

based on attributes of instances
branch for each value
e.g.: Cusomer buy tablet?

Decision Tree

what are the Data Requirements

Attribute value description
- same attributes must describe each example and muste have fixed number of values
predefined classes
- samples attributes must already be defined
discrete classes
- classes must be sharply delineated
sufficient examples
- enough test cases are needed to distinguish valid patterns from chance occurences

Join Course

Preview

Author

Felix S.

Information

Last changed
2 years ago

Report course

LE5 - Modeling I - Clustering / Association Rules

Author

Felix S.

Information