What are the types of Machine Learning Algorithms? Explain them shortly.
supervised learning: usage of training data which is labelled to train the algorithm which creates a model (hypothesis). This model can then be used to predict the labels of unlabelled data.
unsupervised learning: The algorithm puts unlabelled data into clusters. It can’t predict the labels without training data, but can cluster the data by similarity. To label the data, it has to be interpreted manually.
reinforcement learning: With reinforcement learning, the algorithm learns from rewards of previous decisions when they were correct. That way, the algorithm develops a strategy to fulfill a goal.
What are the types of Machine Learning Problems? Explain them shortly. Which machine learning algorithms can be used to solve them?
Regression: Regression is a supervised learning problem. It uses the input data data to fit a hyperplane in space (2D case: line) which predicts values based on these input values. Therefore, the answer is a continuous value (e.g. y = 2x + 3). It estimates the relationship between two or more variables. Example: House prize vs. size. When, in the 2D case, the line is straight (like below), this is called Linear Regression.
Classification: Classification is a supervised learning problem. It uses input data to fit a hyperplane in space which seperates two or more classes. With this model, the class of new, unlabelled data can be predicted like benign vs. malignant tumors (see below),predict soil type from data etc. The answer is discreet (x is either in class 1 or 2).
Segmentation: Segmentation is an unsupervised learning problem which provies a set of clusters. It groups similar data into clusters (the number of clusters can be determined ot eh algorithm determines them itself). It predicts properties by its closeness to the centroid of the clusters.
What is Exploratory Data Analysis and which steps does it include?
Exploratory Data Analysis is the first step towards building a machine learning model. It is understanding the data.
It includes:
Data manipulation: normalization of data when dealing with huge datasets, data filtering/cleaning, replace no data values, export to common formats
Data plotting: visualization of data
Descriptive statistics: summarize data in any way (variances, mean values…), quantify dependency of two data sets, for example latitude and temperature
What are the three ways to analyze data in terms of analyszing variables seperately or together?
univariate: Each variable is analyzed seperately
bivariate: Two variables are analyzed together to look for correlation/separation
multivariate: > 2 variables are analyzed together; here it’s usually difficult to visualize the data and the results
What are the 4 types of variables and what are the possible math operations that can be done to them? How can these 4 variables be generalized?
qualitative variables
nominal (equal or not equal): Examples: Is an area a desert or a forest? -> no ranking applied
ordinal (<, >): Has info on rank/hierarchy, for example: cost of living is high, medium or low; seismic scales
quantitative variables
interval (+,-): includes numerical values and information can be arranged along a scale, e.g. temperature in °C. There are only a few examples of this. The difference to ratio data is that interval data has no natural 0 (20°C is not 2x warmer than 10°C, 0°C doesn’t mean “no temperature”.)
ratio (*,/): Like interval data but with natural 0 (either physical or convention like for elevation). Examples: precipitation, elevation, load capacity of roads, temp in Kelvin (because 0 K = particles have no thermal energy, therefore no “temperature”, 200K is twice as hot as 100K in terms of particle energy). This is the most informative scale.
Explain the Linear Regression model.
The Linear Regression model is a linear model that predicts a target (continuous) value by computing a weighted sum of the input parameters (called features) plus the intercept term (also called bias term). The weights are the model parameters theta, and the intercept term is theta0.
This structure comes from the general formula of a straight line, which is y = ax + b, where b is the bias term and a is the model parameter which determines the slope of the line. These are the values to be determined and fit best to the training data.
The function y_hat is then called the hypothesis function with
With this, there would be one more theta value than x value due to the intercept term, which is why the feature vector must have one more feature = 1.
How do we know which model fits the training data the best? Explain the function.
With help of the cost function. There are several cost functions, the most common one is Root Mean Square Error (RMSE), but the Mean Square Error (MSE) is actually better in this case because it takes less computation time due to the absence of the root. Otherwise, these functions are the same and look like this:
In words:
The Mean Square Error is a cost function which calculates the mean of the squared difference between the hypothesis function and the target values of the input.
The result is a scalar. The bigger the difference (=error) between the hypothesis function and the target (=actual) values, the bigger is the MSE. So, if the MSE decreases, the model fits better.
The best fitting model is therefore a model which produces the smallest MSE. This provides the best set of theta (theta_hat) which minimizes the cost function.
Why do we square the error of the MSE/RMSE?
So that every error is positive.
Smaller errors < 1 don’t have such an impact when squaring the error. So with this, larger errors have much more of an impact on the MSE which determines how good the linear regression model fits. This is important because it is preferred to have a model which makes a lot of small errors than to have a good fitting model that sometimes makes big errors.
What do we need to watch out for then computing a Normal Equation?
A Normal Equation, which is a nxn matrix, has a time complexity of about O(n^2.4) to O(n³). This means that if we double the amount of features (variables) in our model n, the computation takes 2³ = 8 times longer. This is something we need to keep in mind.
Instead, the Normal Equation is linear to the number of training data instances m.
This means in a dataset of, say, trees, it’s no problem to have lots of trees in the dataset but if we don’t only account for height and trunk width but additionally for leaf size and thickness of the bark then the computing time for the Normal Equation increases a lot.
If the normal Equation can’t be computed easily or the data set is very large, a method like Gradient Descent can be used.
Explain the principle of Gradient Descent.
Gradient Descent is a optimization algorithm for finding optimal solutions (best parameters) for a wide range of problems by minimizing a function (often a cost function).
It starts by initializing theta with (small) random values and gradually improving theta by decreasing the cost function, similar to Least Squares Adjustment. The function stops when theta converges to a minimum.
For that, the Gradient Vector has to be computed which is essentially just a vector of the partial derivatives of the cost function in regard to theta. This is done in order to determine how much the cost function changes (goal is for the derivative to be 0! -> find minimum).
When updating theta, the goal is to update them in a direction that reduces the cost function. This is done with the following equation:
The learning rate determines the step size, so it shouldn’t be too small to avoid too many iterations, but it also shouldn’t be too high because then the algorithm might diverge and a good solution may never be found. To find a good learning rate, grid search can be used (test again and again for similar intervals).
When using the Gradient Descent, what could happen when the cost function is not convex?
A convex function determines that a line segment between any two points on the curve does not cross the curve. Therefore, in a convex function, there aren’t any local minima or plateaus.
A non-convex function might have local minima or plateaus. The solution might get stuck in a local minimum or stop on a plateau.
To avoid that, one can use convex functions. Other solutions can be to choose different initial values or increase the learning rate so small ridges can be “overjumped”.
What happens in Linear Regression when the cost function has two or more parameters? How does it look like, what problems can arise?
The cost function won’t be a line anymore but a 2D shape (or more-dimensional).
This cost function can be elongated in one direction is the features have different scales (e.g. a grade ranging from 1-5, but height ranging from 1,50 to 2,50). The features should have similar scales or the converging will take longer!
Therefore, feature scaling can be used to, for example, scale every feature from 0-1.
Explain the Gradient Descent step equation.
The Gradient Descent step equation looks like this:
It's based on the Mean Square Error cost function which is used to estimate how well a model fits the input parameters. Here, MSE is in the gradient vector, which contains all the partial derivatives of the cost function.
With this function, the equation determines where the next point on this function is. With the learning rate, the step size is decided; a stepsize that’s too small needs too many iterations, a stepsize that’s too big could cause the equation to diverge instead of converge, never finding a minimum.
What are the different types of Gradient Descent? Explain the differences between them.
There are three types of Gradient Descent. The difference between them is how much data they use to compute the gradient at each step.
Batch Gradient Descent:
Uses all the training data to compute the gradient before updating -> slow on large datasets, but scales well.
Each step is one epoch (=one complete pass through the entire dataset)
Stochastic Gradient Descent:
picks one random training data instance at every step and computes the Gradient on only that instance -> is faster, might have problems with blunders
jumps around a lot -> migth help escape local minima!, but ends up very close to minimum wothout reaching it
is good for huge datasets that don’t fit into memory as only one instance is needed for every step. Therefore, it only sees one epoch after a few steps, one says: if there are x training instances, the St. Grad. Descent has seen the epoch after x steps (even though that could not be true, because it picks at random!)
to lower the drawbacks of randomness: start training with large learning steps to escape local minima; gradually reduce learning steps to settle at global minimum. This is called simulated annealing
Mini-Batch Gradient Descent:
gradient is computed on small sets of random instaces, called mini-batches; algorithm doesn’t see all the data, therefore less problems with huge datasets
will be closer to minimum than Stochastic Gradient Descent
jumps around less -> harder to escape from local minima
What are hyperparameters? Which hyperparameters are needed in the different Gradient Descent types?
Hyperparameters are settings to choose before training and therefore are not learned from the data.
For the Batch Gradient Descent, 2 are needed:
Learning rate
Number of iterations/convergence tolerance (how long to train)
For the other two Gradient Descent types:
Batch size (mini Batch)
Number of epochs
and possibly some others
What is the difference between Linear and Polynomial regression, and are they linear or non-linear?
In Linear Regression, a straight line is fitted onto the training data, while for polynomial regression, it’s a polynom. However, both are still linear models, because even for polynomial regression the equation is a weighted sum of input parameters.
What happens when the regression model fits the data too well and why?
When the model fits the input data too well, it might lead to overfitting, which means that the model might be very good with input data but not so much with new data.
Here is an example:
Thsi can be tested with validation.
Explain how Validation works. How can you, from a graph, estimate whether a model is under- or overfitting?
Validation is a technique for checking whether the model is under- or overfitting the data by adjusting hyperparameters.
Validation is performed with data that’s not used for training and is instead reserved for validation. Because the model is trained on the training data, it will always perform worse for validation data - meaning the errors are larger. However, by looking at a grph which displays the error vs. the training set size, it can be estimated if the model is under-, overfitting or good.
A big gap between the validation & training data means overfitting, a large error on both means underfitting.
An example here:
Note: In some of the graphs above the traning data set curve doesn’t go up at 0. This is because for the 2nd graph a 10th-degree-model is used, meaning it will fit perfectly for the first 10 training sets before producing an error. Same thing for the thrid graph.
What is Cross-Validation? Explain it shortly.
Cross-Validation is a technique used to assess how well a model performs with unseen data. The difference to “normal” validation is that for Cross-Validation, the data isn’t split into train and validation data once, but instead several times in different ways.
The model gets trained on each and the results are averaged.
To estimate the best model possible, in which catgeories is the data of a data set split into?
training data: used to fit the model
validation data: used to adjust hyperparameters like degree for polynomial regression
test data: test the model in the end; is different from the error the model gets for trainign data as this is data it has already seen. Instead, test data is for checking how well the model fits data onto unseen data
What can be done to minimize overfitting for polynomial regression?
Apply a regularization, like for exanple Ridge Regression or Lasseo Regression.
Explain Regularization models for linear models. Why do we use them, which types did we learn about, how do they work?
Regularization models are used to reduce overfitting by constraining the model. In the case of polynomial regression, a high degree is often the cause for overfitting, but because it’s a hyperparameter, the algorithm can’t reduce the degree while training the model. Therefore, Regularization methods can be applied.
Ridge Regression, sometimes called L2; a good default. Introduces a new hyperparameter which controls how much the model is regularized. The regularization term, which contains the hyperparameter and is equal to the sum of the squared weights (without the bias!) is added to the cost function. It keeps the model weights as small as possible, reducing big curves like here:
Lasso Regression: has the same hyperparameter as in Ridge Regression, but adds the l1 norm of the weight vector to the cost function (as the regularization term) instead. It tends to eliminate the weights of the least important features by setting them to 0 and therefore performs feature selection. Useful when we know that only a few features are actually useful. This is seen here:
Elastic Net: Regularization term is a mix of both Ridge and Lasso regularization; preferred over Lasso Regularization because Lasso reg. might behave erratically in some cases
another method: strop training as soon as validation error reaches minimum so that model can’t overfit at all. See here:
What are the three regression models we learned about in this class?
Regression models = Fit a hyperplane to input data, output: continuous value
The models we learned about were:
Linear Regression (fitting a straight line)
Polynomial Regression (fitting a polynom)
Regularized Linear models (constrained models)
Explain the Logistic Regression model.
The Logistic Regression model (also called Logit Regression) is a linear model that solves the classification problem by estimating the probability p (p_hat) that a data instance belongs to a particular class.
It’s most commonly used for binary classification (two possible classes: p >= 0.5 -> instance belongs to class, labelled 1; < 0.5: instance doesn’t belong to class, labelled 0). Example: Is a tumor benign or malignant?
To do this, the model computes a weighted sum of all input features including the bias term (theta_transposed x, where theta are the parameters and x are the features; acutally thetax + b, where b is the bias term), and calculates the logistic of this weighted sum:
The logistic (also called logic) is a sigmoid function which takes t (the weighted sum!) as an input and outputs a number between 0 and 1 (the probability!). The steepness of the logistic depends on theta.
[sigmoid function: a smooth, s-shaped curve that takes any real number input and outputs a value between 0 and 1]
The p_hat function estimates if an instance belongs to class 0 or 1. What the class represents, we have to set for ourselves.
With this function, y_hat which predicts the class is:
Of course, the parameters theta must be learned by the model, which is why it’s a machine learning problem.
How is a Logistic Regression model trained? Give the equation of the cost function.
A Logistic Regression model, which solves classification problems, estimates a function yhat which is either 0 or 1 depending on the probability function phat; when phat is < 0.5, y = 0 and otherwise y = 1. The yhat function estimates to which class a data instance belongs, either class 0 or class 1.
The phat function depends on the input features x and the parameter vector theta. Theta must be learned so that the model outputs either very high probabilities close to 1 or very low probabilities close to 0, so that a classification is more clear (should assign high probabilities to the correct class!).
This is done by using 2 functions:
-log(phat) if y = 1 -> cost grows large for probabilities close to 0 (here, probability should be close to 1)
-log(1-phat) if y = 0 -> cost grows large for probabilities close to 1 if it should be 0
These two functions are then implemented in a total cost function, which computes the average loss (1/m!) over all training instances:
This cost function is convex to find the global minimum easier (global minimum: Best theta parameters!) by e.g. computing all partial derivatives and using Batch Gradient Descent to find the minimum.
The found theta values incluence then the steepness of the curve of the logistic; a flat transition means the model is uncertain near the decision boundary, a steep transition indicates a clear separation.
Explain roughly how a machine learning model is trained.
A machine learning model uses theta as a parameter vector to estimate the weight of the input features x.
To know which theta produces the best outcome, a cost function is applied which is high for big mistakes and small for tiny mistakes. To find the best theta, the cost function should be as small as possible.
Due to cost functions being curves (often convex!), the minimum is found by taking the partial derivatives of the cost function which are then used an algorithms like Gradient Descent which decrease the function step by step until it converges to a minimum (goal would be that the partial derivatives are 0!).
What are the differences of (Binary) Logistic Regression and Multinomial Logistic Regression?
(Binary) Logistic Regression puts the data instances into two classes, Multinomial Logistic Regression (also called Softmax Regression) classifies into multiple classes.
Explain Multinomial Logistic Regression.
Multinomial Logistic Regression (also Softmax Regression) is a generalization of Logistic Regression into multiple classes.
For this, a score is computed for each class:
Then, the softmax function (also normalized exponential) is applied to each score to estimate the probability of every data instance to each class:
The class is pedicted by picking the class with the highest estimated probability (function yhat):
yhat would then look like = [p(class1), p(class2), p(class3),…]. To get this, the cross entropy cost function is used to estimate the probability for each training instance for each class:
The cross entropy cost function penalozes the model when it estimates a low probability for the right class. To train the model, like in Logistic Regression the gradient vector is computed (here for each class) and the Gradient Descent can be used to minimize the cost function.
What cost functions are used for the following problems?
Linear Regression
Logistic Regression
Multinomial Regression
Linear Regression: MSE (mean squared error) or RMSE (rott mean spuared error)
Logistic Regression: Binary Cross-Entropy (Log loss)
Multinomial Regression: Cross-Entropy
Explain the principle of Support Vector Machine and the two classifications.
Support Vector Machine (SVM) is a non-linear model which uses kernels for binary classification, but it can also be used for linear regression and multi-class classification.
The idea is to create the largest margin possible between the class boundaries, and therefore, the best possible line/polynomial (=decision boundary) to seperate between the classes:
This decision doundary is determined by the data instances closest to the decision boundary which are called support vectors. The fitted line is equally influenced by both support vectors (margin should be the same size on either side).
There is hard margin classification and soft margin classification:
Hard Margin Classification determines that all training instances must be outside the margin and on the correct side, like on the left picture below. This makes this method sensitive to outliers:
Soft Margin Classification is more flexible and tries to find a good balance between a large margin and a limited number of margin violations. SMC uses a Hyperparameter C which determines a penalty for the cost function for each misclassifies training sample (C is small -> more misclassified samples allowed -> more regulation, can reduce overfitting):
Therefore, in the (linear) cost function, when a point is outside or on the margin, it doesn’t contribute to the cost; but when it’s in the margin, the cost increases with the distance of the point to the margin (the closer to decision boundary, the higher the cost is).
For non-linear problems, kernel functions need to be used which allow for a non-linear separation but without adding polynomial features. Examples are polynomial and Gaussian Radial Basis Function (RBF).
What are kernels for SVM? Explain them.
SVM = Support Vector Machine model which can be used for several problems including classification & linear regression.
The kernel functions are needed for non-linear regression & classification when datasets are not linearly seperable.
Therefore, a polynomial combination of input features could be needed but that would need high computational costs!
Therefore, kernels can be used which doesn’t explicitly add features.
Examples are:
polynomial (adds polynomial features)
RBF (Gaussian Radial Basis Function Kernel)
For RBF, a similarity feature is added which measures how similar each data instance is to a landmark l. This is done by the Gaussian Radial Basis Function (RBF) which is a bell-shaped function ranging in height from 0 (far aways from l) to 1 (at l):
The hyperparameter y determines the width of the bell function (narrower with large y) and acts like a regularization - if model is overfitting, reduce y, if model is underfitting, increase y.
At every instance of the dataset, a landmark is chosen.
What is the difference between parametric and non-parametric machine learning models?
Parametric models like Linear & Logistic Regresion summarizes data using a fixed set of parameters which often is a simplification of the actual data, while non-parametric models like decision trees assume that the data distribution can’t be defined with a fixed set of parameters, allowing for more complex relationships.
The parametric model doesn’t change the number of parameters regardless of how much data is provided; non-parametric models do! They adapt well to training data but are prone to overfitting if not regularized (in contrast, parametric models are more prone to underfitting!)
Explain the principle of Decision Trees.
Decision trees are non-parametric, supervised leaning machine learning algorithms which check for conditions based on each feature and moves on to the next level with every decision - which node depends on whether the condition is true or false (true: left child, false: right child). Doing this creates decision boundaries to differentiate between classes:
The conditions are not chosen by hand; in machine learning, the conditions are learned from training data (supervised learning!). A condition consists of the choice of a feature and a threshold value (e.g. petal length > 3 cm).
To measure the quality of each split (of every decision!), the gini impurity function is used - it measures the probability of incorrectly classifying a random chosen element in the dataset. If it is 0, there is no impurity -> perfect split.
Decision trees are of two types:
Classification
Regression
The classification and regression tree (CART) is an umbrella term referred to both.
How does the Classification and Regression Tree (CART) algorithm work?
The CART algorithm is a decision tree algorithm referring to both classification trees and regression trees. It’s an algorithm not guaranteed to find the optimal solution (which is probably not possible in an efficient way -> NP-complete problem), but produces a reasonably good solution.
It recursively splits the training data set into two subsets based on a feature k and a threshold t. Therefore, it creates binary trees (two children for each node).
Therefore, a pair k,t must be found which produces splits with the lowest impurity value (classification)-> for all pairs, the impurity value is calculated. Might take a while: for every k, there are as many choices in t as there are data instances in training set! For regression, the CART algorithm splits training set in a way which minimizes the MSE.
It stops once it reaches the maximum depth (-> hyperparameter), can’t find a split that reduces impurity or other hyperparameters can’t be fulfilled.
The CART cost function for classification calculates the weighted impurities of both sides of a split and therefore determines the condition with the lowest impurity.
For this impurity, either gini or entropy impurity can be used.
What is the difference between Gini Impurity and Entropy Impurity?
Gini Impurity is the probability of incorrectly classifying a randomly chosen element in the dataset if it were randomly labeled according to the class distribution in the dataset.
Entropy, originating from thermodynamics, measures the disorder of molecules. It approaches 0 when molecules are still and well ordered (like ice vs. water; entropy = 0 at 0°K!). In machine learning, a set’s entropy is 0 when it contains instances of only one class.
Most of the time, it makes no difference which one is chosen, but:
Gini impurity is slightly faster to compute but tends to isolate the most frequent class, while entropy impurity tends to produce slightly more balanced trees.
List some regularization hyperparameters for Decision Trees. Why are they needed?
Regularization is necessary for Decision trees because Decision trees are non-parametrical machine learning models and therefore are prone to overfitting if not regulated.
Hyperparameters include:
min. number of samples requires to split a vertex
min. number of samples requires for leaf vertex
sum of total weights required for leaf vertex (here, weights must be procided for each training sample)
max. number of leaf vertices
max. number of features evaluated for splitting a vertex
What is Pruning? Explain.
Pruning is a method of Regularization for Decision trees.
With Pruning, the Decision Tree is trained without restrictions first, and then unnecessary vertices are deleted (pruned).
Unnecessary vertices are internal vertices whose children are all leaf vertices (they have no children themselves) if the purity improvement is not statistically significant, tested e.g. with a chi2-test!
How is Gini Impurity calculated?
Which of the trees is better and why?
In this case, the 2 layer decision tree is better, because for the last split the impurity value increases (on some children) and the sample size gets very low, which is not ideal because then the model is more susceptible for outliers.
Zuletzt geändertvor 8 Tagen