Cluster analysis - idea -> Goal
cluster observations into homogenous groups (i.e., variation within a group is smaller than variation to observations of another group).
K-Means Clustering (centroid based), strengths and limitations:
Strengths:
simple and efficient for large datasets
works well when clusters are spherical and evenly sized
Limitations:
requires specifying the number of clusters in advance
struggles with non-spherical or overlapping clusters
Hierarchical Clustering, strengths and limitations:
no need to specify the number of clusters in advance
produces a dendogram that can be cut at any level to obtain different clusters
computationally expensive for large datasets
sensitive to noise and outliers
K-means clustering-procedure:
Defining number of clusters (k)
Random initialization of k points as centroids
Assign data points to the nearest centroid (Calculate the distance between points and centroids)
Recompute centroids, as the mean of all data points
assigned to it
Repeat assignment and centroid updated (step 3&4)
Check convergence and report the final clusters
Choosing the number of clusters -> elboew criteria
The optimal number of clusters is where the
WCSS starts to decrease at a slower rate.
Small K (too few clusters):
Oversimplifies the data, merges different groups.
High bias, low variance underfitting.
Large K (too many clusters):
Fits small variations/noise, each point may almost become its own cluster.
Low bias, high variance → overfitting.
Bias = how far off your model is, on average, from the true underlying structure.
Variance = how sensitive your model is to small changes in the data.
Hierarchical cluster analysis - Idea:
Hierarchical clustering is a method that builds a hierarchy of clusters either in a bottom-up (agglomerative) or top-down (divisive) manner.
Hierarchical cluster analysis - procedure:
Select distance metrics (eucledian, manhattan, cosine similarity
Create the distance matrix
Initialize clusters
Merge closest clusters following a link method → influences shape of final cluster
Update the distance matrix by replacing old clusters with new clusters
Repeat steps 4 and 5
Stop repetition if there is only one cluster left
Choose the number of clusters (Dendrogram or Elbow criteria)
Step 4, Merging:
ward’s method (minimizes total within-cluster variance by merging clusters resulting in the smallest increase in the total within-cluster variance (or sum of squares))
single linkage (distance between the closest points in two clusters)
complete linkage (distance between the farthest points in two clusters)
average linkage (average distance between all points in the two clusters)
Step 4, choosing numbers of clusters:
The dendrogram measures the average cluster distance and groups observations into clusters
More similar observations are grouped first (i.e., lower average cluster distance)
Reading it from bottom to top, initially, every
observation is a cluster
We can “cut” the dendrogram at different
heights to get different numbers of clusters
Factor Analysis:
Motivation/ Goal
-> reduce dimensionality of variables/ features
Factor analysis tries to identify a set of
common underlying dimensions, known as
factors, in a group of variables.
A factor analysis allows us to reduce the
number of variables and uncover latent
constructs (i.e., features that we did or cannot
measure directly).
Note: Fewer factors than variables; never more
factors than variables.
Purpose of a factor analysis
Gives a smaller number of variables to work with
Reveals interesting patterns (e.g., used in recommendation engines)
Solves problems of multicollinearity if two (or more) variables are very correlated and theoretically meaningful
Factor analysis can test the validity of a scale (e.g., measuring charisma)
Exploratory factor Analysis:
Uncover the underlying structure of a relatively large set of variables
A priori assumption that any indicator may be associated with any factor
Used to show (uni)dimensionality of a scale
Used to assess the reliability of a scale
Always needs to be done if you use a scale
Requirements Exploratory factor analysis:
Usually, variables need to be continuous, but it also works with ordinal data if its distribution is not skewed
Check if variables are normally distributed (e.g., assess histograms or QQ-plot)
A factor analysis works only if the variables are correlated
Examine which variables are correlated to get a sense of the data and potential factors (e.g., correlation matrix)
Check the Kaiser-Mayer-Olkin (KMO) criterion to see if there is sufficient correlation
Methods to extract factors:
Exploratory factor analysis
Methods:
Principal component
Common factor model
factor extraction (maximize the
explained variance among the underlying variables)
- tries to derive a small number of linear combinations
(principal components) to retain as much info from the
original variables as possible.
(models observable variables as
linear combinations of latent factors and uses maximum
likelihood estimation to estimate factor loadings)
Methods to determine number of factors:
Factors having eigenvalues > 1 (Kaiser criterion):
eigenvalue > 1 means factor explains more than its own variance
eigenvalue: sum of squared factor loadings of one factor across all variables
factor loading: correlation of factor and variable
Percentage of variance criterion:
(factors should explain more than 60% of the total variance)
Plot eigenvalues and use ”elbow criterion”
Optimize factors and assess factor loadings
Factor rotation:
Our goal is to make factors more distinct
A variable should mainly correlate with only one factor
Reference axes of the factors can be turned about the origin
Common types of rotation:
orthogonal rotation means axes maintain at 90 (→ factor scores are uncorrelated!)
popular rotations: varimax, quartimax, equimax
Factor Scores
Factor scores
For each observation, a factor analysis will produce, a factor score for each factor
These factor scores are added as new variables to our data set
We can easily use these new variables in further analyses
To compute factor scores/indices, we can either use:
egression → factor scores (used in psychology)
sum/average of the variables we assigned to each factor → called indices (frequently used in economics)
Factor reliability
To assess the reliability of a factor (i.e., how consistently does the factor measure the items associated with it), we can compute Cronbach’s Alpha.
ronbach’s Alpha ranges from 0 to 1, where higher values indicate better internal consistency/reliability
Typically, an alpha of 0.7 or above is considered acceptable for reliability
Last changed13 days ago