5. Selected Math. Techniques

Buffl

Data Science

von Marie R.

Principal Component Analysis (PCA) – Basics

Statistical technique for simplifying datasets.
Goal: reduce dimensionality while preserving maximum variance.
Creates principal components (PCs):
- Linear functions of original variables.
- Uncorrelated with each other.

Why PCA?

Input data often has many correlated/redundant variables → complexity.
PCA constructs new variables (PCs) → most info contained in first few PCs.
Helps simplify data for:
- Machine learning.
- Regression models.

Main Uses of PCA

Dimensionality reduction → fewer variables.
Feature extraction → uncorrelated features.
Data visualization → higher dimensions reduced to 2D/3D.

Applied in image & signal processing.
Often used in ML preprocessing → reduces noise, improves performance.

When to Use PCA?

Do we want to reduce number of variables (without selecting manually)?
Do we want to ensure independence between variables?
If yes → PCA is a good method.

PCA Algorithm – Steps

Standardize data → subtract mean from each variable.
Covariance matrix → measure variance & covariance.
Eigenvalues & eigenvectors →
- Eigenvectors = directions of maximum variance in data
- Eigenvalues = amount of variance explained by each eigenvector
Sort eigenvalues → largest = 1st principal component (PC1).
Reduce dimension → ignore less significant PCs.
Reconstruct dataset → transformed data back into feature space.

PCA Example (x1 & x2 dataset)

PC1 = diagonal axis (captures most variance).
PC2 = perpendicular axis (captures second-highest variance).
If reducing variables → keep PC1, drop PC2.

Clustering Analysis

unsupervised learning technique, that allows the input data to be grouped into unlabeled, meaningful clusters
each cluster includes a group of data records that share a certainlevel of similarity and are at the same time dissimilar to the data records in other clusters
number of clusters is dependent on context, data characteristics, purpose of clustering and evaluation metrics

main applications of clustering

data reduction
outlier detection
developing hypotheses

steps of K-means clustering

decide on number of clusters: make an assumption about the number of clusters, based on elbow method or silhouette method
select random data records to represent the center (centroids) of these clusters
calculate the distances between each data record and the defined centroids. Assign the data record to the cluster containing the centroid closest to that data record (Euclidian distance)
recalculate the new centroid for each cluster by averaging the included data records
repeat steps 3 and 4 until there are no further changes in the calculated centroids
the final clusters comprise the data records included within them

elbow method

within-cluster sum of squares (WCSS) value
calculates the sum of squared errors for different values of K
as the number of cluster increases, the sum of squared distances between data points and their cluster centroids will decrease
K is chosen where these errors start to diminish —> elbow point

silhouette method

how similar a point is to its cluster in comparison with other clusters
ranges from -1 to 1
high value: point matches well to its own cluster and poorly matches to neighboring clusters
low value: point could be assigned to a different cluster

main benefits of Agglomerative Clustering

organization of clusters and subclusters: hierarchy visible
visualization of the clustering process
facilitates classification of new items

Steps of agglomerative clustering

assign each data record to a unique cluster
merge the data records wothin minimum Euclidean distance between them into a single cluster
repeat this process until there is only one cluster remaining —> forming a hierarchy of clusters

Linear Regression

predict the value of a dependent variable (target variable) in a new situation
based on the behavior of the target variable observed in previous situations as well as other data variables
iterative process

Simple Linear Regression Model formula

w0 and w1 are the regression coefficients

best model (linear regression)

the one that minimizes the error term value
to be found by the least-squares method

Multiple Linear Regression Model

more than one independent variable
use least-squares method
the more independent variables, the weaker the assumption of a linear relationship —> use nonlinear regression models

Time-Series Forecasting

used to predict future values based on data observed over time
depends on previous values

concepts for using a forecasting technique

trends: general upward or downward movement over time
seasonality: regular repeating patterns, such as holiday peaks
unexpected events: shocks like economic changes or pandemics
noise: random variation in the data

stationary data

statistical properties stay consistent over time
if data is not stationary, it can be transformed by substracting each value from the previous one

Forecasting Models

Autoregressive Model (AR)
- predicts future values based on past values of the same variable
Moving Average Model (MA)
- predicts values based on prediction errors
ARMA model
- combines AR and MA to capture both past values and past errors
ARIMA model
- adds a step that makes the data stationary before applying AR and MA (3 parts: AR, differencing to remove trends, MA)
SARIMA model
- extension of ARIMA that also handles seasonal patterns

autocorrelation

how strongly current values relate to past values
helps identify which past points are useful for prediction and guides how many past values to include in models

Partial Autocorrelation

helps determine how many past values directly influence the present without interference from the points in between
helps decide the structure of ARIMA models

Least-Squares method

find the optimal value of the parameter w0 and w1 so that the predicted values are as close as possible to the actual data points

Beitreten

Vorschau

Author

Marie R.

Informationen

Zuletzt geändert
vor 16 Tagen

Kurs melden