Principal Component Analysis (PCA) – Basics
Statistical technique for simplifying datasets.
Goal: reduce dimensionality while preserving maximum variance.
Creates principal components (PCs):
Linear functions of original variables.
Uncorrelated with each other.
Why PCA?
Input data often has many correlated/redundant variables → complexity.
PCA constructs new variables (PCs) → most info contained in first few PCs.
Helps simplify data for:
Machine learning.
Regression models.
Main Uses of PCA
Dimensionality reduction → fewer variables.
Feature extraction → uncorrelated features.
Data visualization → higher dimensions reduced to 2D/3D.
Applied in image & signal processing.
Often used in ML preprocessing → reduces noise, improves performance.
When to Use PCA?
Do we want to reduce number of variables (without selecting manually)?
Do we want to ensure independence between variables?
If yes → PCA is a good method.
PCA Algorithm – Steps
Standardize data → subtract mean from each variable.
Covariance matrix → measure variance & covariance.
Eigenvalues & eigenvectors →
Eigenvectors = directions of maximum variance in data
Eigenvalues = amount of variance explained by each eigenvector
Sort eigenvalues → largest = 1st principal component (PC1).
Reduce dimension → ignore less significant PCs.
Reconstruct dataset → transformed data back into feature space.
PCA Example (x1 & x2 dataset)
PC1 = diagonal axis (captures most variance).
PC2 = perpendicular axis (captures second-highest variance).
If reducing variables → keep PC1, drop PC2.
Clustering Analysis
unsupervised learning technique, that allows the input data to be grouped into unlabeled, meaningful clusters
each cluster includes a group of data records that share a certainlevel of similarity and are at the same time dissimilar to the data records in other clusters
number of clusters is dependent on context, data characteristics, purpose of clustering and evaluation metrics
main applications of clustering
data reduction
outlier detection
developing hypotheses
steps of K-means clustering
decide on number of clusters: make an assumption about the number of clusters, based on elbow method or silhouette method
select random data records to represent the center (centroids) of these clusters
calculate the distances between each data record and the defined centroids. Assign the data record to the cluster containing the centroid closest to that data record (Euclidian distance)
recalculate the new centroid for each cluster by averaging the included data records
repeat steps 3 and 4 until there are no further changes in the calculated centroids
the final clusters comprise the data records included within them
elbow method
within-cluster sum of squares (WCSS) value
calculates the sum of squared errors for different values of K
as the number of cluster increases, the sum of squared distances between data points and their cluster centroids will decrease
K is chosen where these errors start to diminish —> elbow point
silhouette method
how similar a point is to its cluster in comparison with other clusters
ranges from -1 to 1
high value: point matches well to its own cluster and poorly matches to neighboring clusters
low value: point could be assigned to a different cluster
main benefits of Agglomerative Clustering
organization of clusters and subclusters: hierarchy visible
visualization of the clustering process
facilitates classification of new items
Steps of agglomerative clustering
assign each data record to a unique cluster
merge the data records wothin minimum Euclidean distance between them into a single cluster
repeat this process until there is only one cluster remaining —> forming a hierarchy of clusters
Linear Regression
predict the value of a dependent variable (target variable) in a new situation
based on the behavior of the target variable observed in previous situations as well as other data variables
iterative process
Simple Linear Regression Model formula
w0 and w1 are the regression coefficients
best model (linear regression)
the one that minimizes the error term value
to be found by the least-squares method
Multiple Linear Regression Model
more than one independent variable
use least-squares method
the more independent variables, the weaker the assumption of a linear relationship —> use nonlinear regression models
Time-Series Forecasting
used to predict future values based on data observed over time
depends on previous values
concepts for using a forecasting technique
trends: general upward or downward movement over time
seasonality: regular repeating patterns, such as holiday peaks
unexpected events: shocks like economic changes or pandemics
noise: random variation in the data
stationary data
statistical properties stay consistent over time
if data is not stationary, it can be transformed by substracting each value from the previous one
Forecasting Models
Autoregressive Model (AR)
predicts future values based on past values of the same variable
Moving Average Model (MA)
predicts values based on prediction errors
ARMA model
combines AR and MA to capture both past values and past errors
ARIMA model
adds a step that makes the data stationary before applying AR and MA (3 parts: AR, differencing to remove trends, MA)
SARIMA model
extension of ARIMA that also handles seasonal patterns
autocorrelation
how strongly current values relate to past values
helps identify which past points are useful for prediction and guides how many past values to include in models
Partial Autocorrelation
helps determine how many past values directly influence the present without interference from the points in between
helps decide the structure of ARIMA models
Least-Squares method
find the optimal value of the parameter w0 and w1 so that the predicted values are as close as possible to the actual data points
Zuletzt geändertvor 16 Tagen