Geo Data Science

von Julia S.

What are the types of Machine Learning Algorithms? Explain them shortly.

supervised learning: usage of training data which is labelled to train the algorithm which creates a model (hypothesis). This model can then be used to predict the labels of unlabelled data.
unsupervised learning: The algorithm puts unlabelled data into clusters. It can’t predict the labels without training data, but can cluster the data by similarity. To label the data, it has to be interpreted manually.
reinforcement learning: With reinforcement learning, the algorithm learns from rewards of previous decisions when they were correct. That way, the algorithm develops a strategy to fulfill a goal.

What are the types of Machine Learning Problems? Explain them shortly. Which machine learning algorithms can be used to solve them?

Regression: Regression is a supervised learning problem. It uses the input data data to fit a hyperplane in space (2D case: line) which predicts values based on these input values. Therefore, the answer is a continuous value (e.g. y = 2x + 3). It estimates the relationship between two or more variables. Example: House prize vs. size. When, in the 2D case, the line is straight (like below), this is called Linear Regression.

Classification: Classification is a supervised learning problem. It uses input data to fit a hyperplane in space which seperates two or more classes. With this model, the class of new, unlabelled data can be predicted like benign vs. malignant tumors (see below),predict soil type from data etc. The answer is discreet (x is either in class 1 or 2).

Segmentation: Segmentation is an unsupervised learning problem which provies a set of clusters. It groups similar data into clusters (the number of clusters can be determined or the algorithm determines them itself). It predicts properties by its closeness to the centroid of the clusters.

What is Exploratory Data Analysis and which steps does it include?

Exploratory Data Analysis is the first step towards building a machine learning model. It is understanding the data.

It includes:

Data manipulation: normalization of data when dealing with huge datasets, data filtering/cleaning, replace no data values, export to common formats
Data plotting: visualization of data
Descriptive statistics: summarize data in any way (variances, mean values…), quantify dependency of two data sets, for example latitude and temperature

What are the three ways to analyze data in terms of analyzing variables seperately or together?

univariate: Each variable is analyzed seperately
bivariate: Two variables are analyzed together to look for correlation/separation
multivariate: > 2 variables are analyzed together; here it’s usually difficult to visualize the data and the results

What are the 4 types of variables and what are the possible math operations that can be done to them? How can these 4 variables be generalized?

qualitative variables
- nominal (equal or not equal): Examples: Is an area a desert or a forest? -> no ranking applied
- ordinal (<, >): Has info on rank/hierarchy, for example: cost of living is high, medium or low; seismic scales
quantitative variables
- interval (+,-): includes numerical values and information can be arranged along a scale, e.g. temperature in °C. There are only a few examples of this. The difference to ratio data is that interval data has no natural 0 (20°C is not 2x warmer than 10°C, 0°C doesn’t mean “no temperature”.)
- ratio (*,/): Like interval data but with natural 0 (either physical or convention like for elevation). Examples: precipitation, elevation, load capacity of roads, temp in Kelvin (because 0 K = particles have no thermal energy, therefore no “temperature”, 200K is twice as hot as 100K in terms of particle energy). This is the most informative scale.

Explain the Linear Regression model.

The Linear Regression model is a linear model that predicts a target (continuous) value by computing a weighted sum of the input parameters (called features) plus the intercept term (also called bias term). The weights are the model parameters theta, and the intercept term is theta0.

This structure comes from the general formula of a straight line, which is y = ax + b, where b is the bias term and a is the model parameter which determines the slope of the line. These are the values to be determined and fit best to the training data.

The function y_hat is then called the hypothesis function with

With this, there would be one more theta value than x value due to the intercept term, which is why the feature vector must have one more feature = 1.

How do we know which model fits the training data the best? Explain the function.

With help of the cost function. There are several cost functions, the most common one is Root Mean Square Error (RMSE), but the Mean Square Error (MSE) is actually better in this case because it takes less computation time due to the absence of the root. Otherwise, these functions are the same and look like this:

In words:

The Mean Square Error is a cost function which calculates the mean of the squared difference between the hypothesis function and the target values of the input.

The result is a scalar. The bigger the difference (=error) between the hypothesis function and the target (=actual) values, the bigger is the MSE. So, if the MSE decreases, the model fits better.

The best fitting model is therefore a model which produces the smallest MSE. This provides the best set of theta (theta_hat) which minimizes the cost function.

Why do we square the error of the MSE/RMSE?

So that every error is positive.
Smaller errors < 1 don’t have such an impact when squaring the error. So with this, larger errors have much more of an impact on the MSE which determines how good the linear regression model fits. This is important because it is preferred to have a model which makes a lot of small errors than to have a good fitting model that sometimes makes big errors.

What do we need to watch out for when computing a Normal Equation?

A Normal Equation, which is a nxn matrix, has a time complexity of about O(n^2.4) to O(n³). This means that if we double the amount of features (variables) in our model n, the computation takes 2³ = 8 times longer. This is something we need to keep in mind.

Instead, the Normal Equation is linear to the number of training data instances m.

This means in a dataset of, say, trees, it’s no problem to have lots of trees in the dataset but if we don’t only account for height and trunk width but additionally for leaf size and thickness of the bark then the computing time for the Normal Equation increases a lot.

If the normal Equation can’t be computed easily or the data set is very large, a method like Gradient Descent can be used.

Explain the principle of Gradient Descent.

Gradient Descent is a optimization algorithm for finding optimal solutions (best parameters) for a wide range of problems by minimizing a function (often a cost function).

It starts by initializing theta with (small) random values and gradually improving theta by decreasing the cost function, similar to Least Squares Adjustment. The function stops when theta converges to a minimum.

For that, the Gradient Vector has to be computed which is essentially just a vector of the partial derivatives of the cost function in regard to theta. This is done in order to determine how much the cost function changes (goal is for the derivative to be 0! -> find minimum).

When updating theta, the goal is to update them in a direction that reduces the cost function. This is done with the following equation:

The learning rate determines the step size, so it shouldn’t be too small to avoid too many iterations, but it also shouldn’t be too high because then the algorithm might diverge and a good solution may never be found. To find a good learning rate, grid search can be used (test again and again for similar intervals).

When using the Gradient Descent, what could happen when the cost function is not convex?

A convex function determines that a line segment between any two points on the curve does not cross the curve. Therefore, in a convex function, there aren’t any local minima or plateaus.

A non-convex function might have local minima or plateaus. The solution might get stuck in a local minimum or stop on a plateau.

To avoid that, one can use convex functions. Other solutions can be to choose different initial values or increase the learning rate so small ridges can be “overjumped”.

What happens in Linear Regression when the cost function has two or more parameters? How does it look like, what problems can arise?

The cost function won’t be a line anymore but a 2D shape (or more-dimensional).

This cost function can be elongated in one direction is the features have different scales (e.g. a grade ranging from 1-5, but height ranging from 1,50 to 2,50). The features should have similar scales or the converging will take longer!

Therefore, feature scaling can be used to, for example, scale every feature from 0-1.

Explain the Gradient Descent step equation.

The Gradient Descent step equation looks like this:

It's based on the Mean Square Error cost function which is used to estimate how well a model fits the input parameters (Gradient Descent can also use other cost functions!) Here, MSE is in the gradient vector, which contains all the partial derivatives of the cost function.

With this function, the equation determines where the next point on this function is. With the learning rate, the step size is decided; a stepsize that’s too small needs too many iterations, a stepsize that’s too big could cause the equation to diverge instead of converge, never finding a minimum.

What are the different types of Gradient Descent? Explain the differences between them.

There are three types of Gradient Descent. The difference between them is how much data they use to compute the gradient at each step.

Batch Gradient Descent:

Uses all the training data to compute the gradient before updating -> slow on large datasets, but scales well.
Each step is one epoch (=one complete pass through the entire dataset)

Stochastic Gradient Descent:

picks one random training data instance at every step and computes the Gradient on only that instance -> is faster, might have problems with blunders
jumps around a lot -> migth help escape local minima!, but ends up very close to minimum wothout reaching it
is good for huge datasets that don’t fit into memory as only one instance is needed for every step. Therefore, it only sees one epoch after a few steps, one says: if there are x training instances, the St. Grad. Descent has seen the epoch after x steps (even though that could not be true, because it picks at random!)
to lower the drawbacks of randomness: start training with large learning steps to escape local minima; gradually reduce learning steps to settle at global minimum. This is called simulated annealing

Mini-Batch Gradient Descent:

gradient is computed on small sets of random instaces, called mini-batches; algorithm doesn’t see all the data, therefore less problems with huge datasets
will be closer to minimum than Stochastic Gradient Descent
jumps around less -> harder to escape from local minima

What are hyperparameters? Which hyperparameters are needed in the different Gradient Descent types?

Hyperparameters are settings to choose before training and therefore are not learned from the data.

For the Batch Gradient Descent, 2 are needed:

Learning rate
Number of iterations/convergence tolerance (how long to train)

For the other two Gradient Descent types:

Learning rate
Batch size (mini Batch)
Number of epochs
and possibly some others

What is the difference between Linear and Polynomial regression, and are they linear or non-linear?

In Linear Regression, a straight line is fitted onto the training data, while for polynomial regression, it’s a polynom. However, both are still linear models, because even for polynomial regression the equation is a weighted sum of input parameters.

What happens when the regression model fits the data too well and why?

When the model fits the input data too well, it might lead to overfitting, which means that the model might be very good with input data but not so much with new data.

Here is an example:

Thsi can be tested with validation.

Explain how Validation works. How can you, from a graph, estimate whether a model is under- or overfitting?

Validation is a technique for checking whether the model is under- or overfitting the data by adjusting hyperparameters.

Validation is performed with data that’s not used for training and is instead reserved for validation. Because the model is trained on the training data, it will always perform worse for validation data - meaning the errors are larger. However, by looking at a grph which displays the error vs. the training set size, it can be estimated if the model is under-, overfitting or good.

A big gap between the validation & training data means overfitting, a large error on both means underfitting.

An example here:

Note: In some of the graphs above the traning data set curve doesn’t go up at 0. This is because for the 2nd graph a 10th-degree-model is used, meaning it will fit perfectly for the first 10 training sets before producing an error. Same thing for the thrid graph.

What is Cross-Validation? Explain it shortly.

Cross-Validation is a technique used to assess how well a model performs with unseen data. The difference to “normal” validation is that for Cross-Validation, the data isn’t split into train and validation data once, but instead several times in different ways.

The model gets trained on each and the results are averaged.

To estimate the best model possible, in which catgeories is the data of a data set split into?

training data: used to fit the model
validation data: used to adjust hyperparameters like degree for polynomial regression
test data: test the model in the end; is different from the error the model gets for trainign data as this is data it has already seen. Instead, test data is for checking how well the model fits data onto unseen data

What can be done to minimize overfitting for polynomial regression?

Apply a regularization, like for example Ridge Regression or Lasso Regression.

Explain Regularization models for linear models. Why do we use them, which types did we learn about, how do they work?

Regularization models are used to reduce overfitting by constraining the model. In the case of polynomial regression, a high degree is often the cause for overfitting, but because it’s a hyperparameter, the algorithm can’t reduce the degree while training the model. Therefore, Regularization methods can be applied.

Ridge Regression, sometimes called L2; a good default. Introduces a new hyperparameter which controls how much the model is regularized. The regularization term, which contains the hyperparameter and is equal to the sum of the squared weights (without the bias!) is added to the cost function. It keeps the model weights as small as possible, reducing big curves like here:

Lasso Regression: has the same hyperparameter as in Ridge Regression, but adds the l1 norm of the weight vector to the cost function (as the regularization term) instead. It tends to eliminate the weights of the least important features by setting them to 0 and therefore performs feature selection. Useful when we know that only a few features are actually useful. This is seen here:

Elastic Net: Regularization term is a mix of both Ridge and Lasso regularization; preferred over Lasso Regularization because Lasso reg. might behave erratically in some cases
another method: stop training as soon as validation error reaches minimum so that model can’t overfit at all. See here:

What are the three regression models we learned about in this class?

Regression models = Fit a hyperplane to input data, output: continuous value

The models we learned about were:

Linear Regression (fitting a straight line)
Polynomial Regression (fitting a polynom)
Regularized Linear models (constrained models)

Explain the Logistic Regression model.

The Logistic Regression model (also called Logit Regression) is a linear model that solves the classification problem by estimating the probability p (p_hat) that a data instance belongs to a particular class.

It’s most commonly used for binary classification (two possible classes: p >= 0.5 -> instance belongs to class, labelled 1; < 0.5: instance doesn’t belong to class, labelled 0). Example: Is a tumor benign or malignant?

To do this, the model computes a weighted sum of all input features including the bias term (theta_transposed x, where theta are the parameters and x are the features; acutally thetax + b, where b is the bias term), and calculates the logistic of this weighted sum:

The logistic (also called logic) is a sigmoid function which takes t (the weighted sum!) as an input and outputs a number between 0 and 1 (the probability!). The steepness of the logistic depends on theta.

[sigmoid function: a smooth, s-shaped curve that takes any real number input and outputs a value between 0 and 1]

The p_hat function estimates if an instance belongs to class 0 or 1. What the class represents, we have to set for ourselves.

With this function, y_hat which predicts the class is:

Of course, the parameters theta must be learned by the model, which is why it’s a machine learning problem.

How is a Logistic Regression model trained? Give the equation of the cost function.

A Logistic Regression model, which solves classification problems, estimates a function yhat which is either 0 or 1 depending on the probability function phat; when phat is < 0.5, y = 0 and otherwise y = 1. The yhat function estimates to which class a data instance belongs, either class 0 or class 1.

The phat function depends on the input features x and the parameter vector theta. Theta must be learned so that the model outputs either very high probabilities close to 1 or very low probabilities close to 0, so that a classification is more clear (should assign high probabilities to the correct class!).

This is done by using 2 functions:

-log(phat) if y = 1 -> cost grows large for probabilities close to 0 (here, probability should be close to 1)
-log(1-phat) if y = 0 -> cost grows large for probabilities close to 1 if it should be 0

These two functions are then implemented in a total cost function, which computes the average loss (1/m!) over all training instances:

This cost function is convex to find the global minimum easier (global minimum: Best theta parameters!) by e.g. computing all partial derivatives and using Batch Gradient Descent to find the minimum.

The found theta values incluence then the steepness of the curve of the logistic; a flat transition means the model is uncertain near the decision boundary, a steep transition indicates a clear separation.

Explain roughly how a machine learning model is trained.

A machine learning model uses theta as a parameter vector to estimate the weight of the input features x.

To know which theta produces the best outcome, a cost function is applied which is high for big mistakes and small for tiny mistakes. To find the best theta, the cost function should be as small as possible.

Due to cost functions being curves (often convex!), the minimum is found by taking the partial derivatives of the cost function which are then used an algorithms like Gradient Descent which decrease the function step by step until it converges to a minimum (goal would be that the partial derivatives are 0!).

What are the differences of (Binary) Logistic Regression and Multinomial Logistic Regression?

(Binary) Logistic Regression puts the data instances into two classes, Multinomial Logistic Regression (also called Softmax Regression) classifies into multiple classes.

Explain Multinomial Logistic Regression.

Multinomial Logistic Regression (also Softmax Regression) is a generalization of Logistic Regression into multiple classes.

For this, a score is computed for each class:

Then, the softmax function (also normalized exponential) is applied to each score to estimate the probability of every data instance to each class:

The class is pedicted by picking the class with the highest estimated probability (function yhat):

yhat would then look like = [p(class1), p(class2), p(class3),…]. To get this, the cross entropy cost function is used to estimate the probability for each training instance for each class:

The cross entropy cost function penalozes the model when it estimates a low probability for the right class. To train the model, like in Logistic Regression the gradient vector is computed (here for each class) and the Gradient Descent can be used to minimize the cost function.

What cost functions are used for the following problems?

Linear Regression
Logistic Regression
Multinomial Regression

Linear Regression: MSE (mean squared error) or RMSE (root mean spuared error)

Logistic Regression: Binary Cross-Entropy (Log loss)

Multinomial Regression: Cross-Entropy

Explain the principle of Support Vector Machine and the two classifications.

Support Vector Machine (SVM) is a non-linear model which uses kernels for binary classification, but it can also be used for linear regression and multi-class classification.

The idea is to create the largest margin possible between the class boundaries, and therefore, the best possible line/polynomial (=decision boundary) to seperate between the classes:

This decision doundary is determined by the data instances closest to the decision boundary which are called support vectors. The fitted line is equally influenced by both support vectors (margin should be the same size on either side).

There is hard margin classification and soft margin classification:

Hard Margin Classification determines that all training instances must be outside the margin and on the correct side, like on the left picture below. This makes this method sensitive to outliers:

Soft Margin Classification is more flexible and tries to find a good balance between a large margin and a limited number of margin violations. SMC uses a Hyperparameter C which determines a penalty for the cost function for each misclassifies training sample (C is small -> more misclassified samples allowed -> more regulation, can reduce overfitting):

Therefore, in the (linear) cost function, when a point is outside or on the margin, it doesn’t contribute to the cost; but when it’s in the margin, the cost increases with the distance of the point to the margin (the closer to decision boundary, the higher the cost is).

For non-linear problems, kernel functions need to be used which allow for a non-linear separation but without adding polynomial features. Examples are polynomial and Gaussian Radial Basis Function (RBF).

What are kernels for SVM? Explain them.

SVM = Support Vector Machine model which can be used for several problems including classification & linear regression.

The kernel functions are needed for non-linear regression & classification when datasets are not linearly seperable.

Therefore, a polynomial combination of input features could be needed but that would need high computational costs!

Therefore, kernels can be used which doesn’t explicitly add features.

Examples are:

polynomial (adds polynomial features)
RBF (Gaussian Radial Basis Function Kernel)

For RBF, a similarity feature is added which measures how similar each data instance is to a landmark l. This is done by the Gaussian Radial Basis Function (RBF) which is a bell-shaped function ranging in height from 0 (far aways from l) to 1 (at l):

The hyperparameter y determines the width of the bell function (narrower with large y) and acts like a regularization - if model is overfitting, reduce y, if model is underfitting, increase y.

At every instance of the dataset, a landmark is chosen.

What is the difference between parametric and non-parametric machine learning models?

Parametric models like Linear & Logistic Regresion summarizes data using a fixed set of parameters which often is a simplification of the actual data, while non-parametric models like decision trees assume that the data distribution can’t be defined with a fixed set of parameters, allowing for more complex relationships.

The parametric model doesn’t change the number of parameters regardless of how much data is provided; non-parametric models do! They adapt well to training data but are prone to overfitting if not regularized (in contrast, parametric models are more prone to underfitting!)

Explain the principle of Decision Trees.

Decision trees are non-parametric, supervised leaning machine learning algorithms which check for conditions based on each feature and moves on to the next level with every decision - which node depends on whether the condition is true or false (true: left child, false: right child). Doing this creates decision boundaries to differentiate between classes:

The conditions are not chosen by hand; in machine learning, the conditions are learned from training data (supervised learning!). A condition consists of the choice of a feature and a threshold value (e.g. petal length > 3 cm).

To measure the quality of each split (of every decision!), the gini impurity function is used - it measures the probability of incorrectly classifying a random chosen element in the dataset. If it is 0, there is no impurity -> perfect split.

Decision trees are of two types:

Classification
Regression

The classification and regression tree (CART) is an umbrella term referred to both.

How does the Classification and Regression Tree (CART) algorithm work?

The CART algorithm is a decision tree algorithm referring to both classification trees and regression trees. It’s an algorithm not guaranteed to find the optimal solution (which is probably not possible in an efficient way -> NP-complete problem), but produces a reasonably good solution.

It recursively splits the training data set into two subsets based on a feature k and a threshold t. Therefore, it creates binary trees (two children for each node).

Therefore, a pair k,t must be found which produces splits with the lowest impurity value (classification)-> for all pairs, the impurity value is calculated. Might take a while: for every k, there are as many choices in t as there are data instances in training set! For regression, the CART algorithm splits training set in a way which minimizes the MSE.

It stops once it reaches the maximum depth (-> hyperparameter), can’t find a split that reduces impurity or other hyperparameters can’t be fulfilled.

The CART cost function for classification calculates the weighted impurities of both sides of a split and therefore determines the condition with the lowest impurity.

For this impurity, either gini or entropy impurity can be used.

What is the difference between Gini Impurity and Entropy Impurity?

Gini Impurity is the probability of incorrectly classifying a randomly chosen element in the dataset if it were randomly labeled according to the class distribution in the dataset.

Entropy, originating from thermodynamics, measures the disorder of molecules. It approaches 0 when molecules are still and well ordered (like ice vs. water; entropy = 0 at 0°K!). In machine learning, a set’s entropy is 0 when it contains instances of only one class.

Most of the time, it makes no difference which one is chosen, but:

Gini impurity is slightly faster to compute but tends to isolate the most frequent class, while entropy impurity tends to produce slightly more balanced trees.

List some regularization hyperparameters for Decision Trees. Why are they needed?

Regularization is necessary for Decision trees because Decision trees are non-parametrical machine learning models and therefore are prone to overfitting if not regulated.

Hyperparameters include:

min. number of samples required to split a vertex
min. number of samples required for leaf vertex
sum of total weights required for leaf vertex (here, weights must be procided for each training sample)
max. number of leaf vertices
max. number of features evaluated for splitting a vertex

What is Pruning? Explain.

Pruning is a method of Regularization for Decision trees.

With Pruning, the Decision Tree is trained without restrictions first, and then unnecessary vertices are deleted (pruned).

Unnecessary vertices are internal vertices whose children are all leaf vertices (they have no children themselves) if the purity improvement is not statistically significant, tested e.g. with a chi2-test!

How is Gini Impurity calculated?

Which of the trees is better and why?

In this case, the 2 layer decision tree is better, because for the last split the impurity value increases (on some children) and the sample size gets very low, which is not ideal because then the model is more susceptible for outliers.

What is the principle of Ensemble Learning?

Ensemble Learning is a technique where multiple models are trained and combined to solve the same problem. The principle is that a lot of predictors in combination are better than the best individual predictor.

Some ensemble learning techniques:

voting classifier
bagging
pasting

What is a Voting Classifier and how does it work?

A Voting Classifier is an Ensemble Learning method which is used for classification problems.

To get a diverse ensemble of classifier, different training algorithms are chosen; every model makes a prediction and from all predictions combined the ensemble’s prediction is made (either through hard or soft voting).

Even if each classifier is a weak learner (=does only slightly better than random guess), the ensemble can still be a strong learner archieving high accuracy IF the classifiers are diverse and there is a sufficient number of them.

Explain Hard vs. Soft Voting.

Hard and soft voting are relevant for Voting classifiers, an ensemble learning method which predicts a class using several classifiers.

Hard voting means that the ensemble’s prediciton is the majority vote (the class that was predicted most often by the individual classifiers).

Soft voring is a bit more nuanced: The class probabilities for all classes are averaged for every classifier, and then the class with the highest total probability is chosen as the ensemble’s prediction. This, of course, is only possible if all classifiers are able to estimate class probabilities.

What is the difference between Bagging and Pasting?

Bagging = bootstrap aggregation and pasting are both ensemble learning technques which split the training data into random subsets and train the classifiers on them to get a diverse ensemble of classifiers. For this, using different training algorithms is not neccessary, usually, the same base model is used -> diversity is introduced via data and randomness.

The difference is that Bagging samples with replacement and Pasting samples without; therefore, with pasting no training data values will repeat (no duplicates!). With bagging, some instances might not be sampled at all (=out-of-bag). The performance of the (bagging) model can be evaluated on the out-of-bag-instances).

What is a Random Forest?

A Random Forest in an ensemble of Decision Trees, generally trained with bagging methods (sometimes pasting) - meaning the Decision trees are all a little different because they consist of a random sample from the original training set. For each split in every tree, not the very best feature is selected but the best feature among a random subset of features which results in a greater tree diversity.

To find the best feature among the subsets, the Random Forest algorithm looks at how much the tree nodes that use the feature reduce impurity on average.

What is Hypothesis Boosting?

Hypothesis Boosting is an Ensemble method that combines several weak learners (result is just slightly better than random guesses) into a strong learner.

It works like this: Every model tries to correct the errors of the previous ones.

There are several methods, e.g. AdaBoost and Gradient Boosting.

List three Ensemble methods and shortly explain what they do.

Random Forests: Ensemble of decision trees that are build in parallel where at every split not the very best feature is selected, but instead the best feature among a random subset of features is selected -> greater tree diversity & probably less overfitting
(Hypothesis) Boosting: Ensemble of decision trees where every tree tries to fix the errors of the previous one -> combines weak learners into strong learners, good against underfitting
Stacking (Stacked Generalization): combines multiple different models and uses a meta-model to learn how to combine the outputs of them

What is AdaBoost and how does it work?

AdaBoost is short for Adaptive Boosting and is a Boosting method.

It works by letting the predictor correct its predecessor by paying more attention to the training instances the predecessor underfitted.

This is done by increasing the weight of the misclassified training instances.

For the final vote, the weight of the predictor (based on the errors it made) is calculated -> the more accurate the predictor was, the higher is his weight -> has “more to say” in the voting.

Is often faster than Gradient Boosting, but less robust to outliers.

What is Gradient Boosting and how does it work?

Gradient Boosting is a Boosting method.

It works by predicting the residuals (errors) the previous predictor made - so it focuses on the remaining error. The prediction (of these errors) is then added to the overall model -> combined prediction in the end.

Is slower than AdaBoost, but more accurate.

Are geographical locations good input features? Why or why not?

Geographical locations are often not good input features because the algorithm learns what is there at these exact locations and not things that could be present at other locations that are similar.

What are the steps needed for Point Cloud Classification?

Neighborhood selection from 3D Point Cloud: The neighborhood is a group of points around a target point. Can be either spherical, cylindrical or flexible (k closest neighboirs in 3D or multi scale multi type neighborhoods)
Feature Extraction: The covariance matrix from the coordinates of the points in the neighborhood is computed, then eigenvalues & eigenvectors are computed, which describe the principal axes of the point distribution. From the eigenvalues, featurs like planarity or sphericity can be derived.
Feature Selection: Selection of suitable features among the features that were extracted earlier, for example with filter based methods, wrapper-based methods or embedded methods
Supervised Classification: With the selected features, a supervised classification model can be trained

How does Feature Extraction for 3D Point Clouds work?

Before features can be extracted, a neighborhood must be defined.

When that is done, the point set is centered. Then, from the locations of the neighborhood a 3x3 covariance matrix (due to x,y,z coordinates) is computed which describes how points in a neighborhood are spread out in space - it shows how much the points vary along and between the axes. For example, if the points have a large variation in y, sigma_yy would show a bigger number.

An eigenvector of matrix C is a direction that doesn't change when C is applied to it — only its length is scaled. The eigenvalue tells you how much it is scaled.

Imagine the point cloud in a neighborhood forms a blob in space: The eigenvectors are the directions of the blob’s principal axes. The eigenvalues tell you how stretched the point cloud is along those directions. Basically it’s like fitting a 3D ellipsoid to the neighborhood. The axes of the ellipsoid are the eigenvectors and the length of these axes represent the eigenvalues.

You get 3 eigenvalues (l1: direction with most spread, l2: second-most variation, l3: least variation) & 3 orthogonal eigenvectors.

From the eigenvalues features can be extracted. For example, the linearity of the features is checked as well as the sphericity or the change of curvature (measures how curved the surface is). For this, the neighborhood must be defined in a meaningful way, for example: cylindrical neighborhoods avoid mixing vertical structures like trees with the ground. In general, this is why using multi type, multi scale neighborhoods is very useful.

Additionally, binning can be used which divides the 3D space into voxels and analyzes each voxel seperately in terms of mean height, number of points etc.

What methods exist for Feature Selection for 3D point Clouds?

filter based methods: after training a model, the model gives a measure of importance for each feature and features with low importance values are removed from the data. Can be solved via statistical tests like chi2-test. Doesn’t take feature interactions into account, but is fast & simple
wrapper-based methods: the set of all possible feature subsets are searched by applying the learning algorithm to each subset; the model is trained on each subset, picks the subset with the best performance. Considers feature interactions, but is slow especially with many features
embedded methods: search for an optimal subset of features is directly related to the learning algorithm itself like in Random Forests; feature selection is part of training, therefore very efficient; is model-dependant (can’t reuse results with different classifier)

What are the two key techniques of unsupervised learning? Shortly explain how they work

dimensionality reduction: simplify high-dimensional data by reducing the number of features while preserving important information - this reduces noise & redundancy, improves efficiency and visualization. Examples are PCA (Principal Component Analysis), SVD (Singular Value Decomposition) and t-SNE (t-distributed Stochastic neighbor embedding)
clustering: groups objects together based on their similarity where points in the same cluster are more similar to each other than to those in other clusters. Is often a pre-processing step for other algorithms. Examples are k-means Clustering, Hierarchical Clustering and DBSCAN (Density-based Spatial CLustering of Applications with Noise)

Explain how the three different Clustering methods for Unsupervised Learning work.

k-means Clustering
- number k is chosen (number of clusters); algorithm assigns each points to the nearest cluster center & iteratively updates cluster centers so that the within-cluster variation is minimized
- easy to implement & relatively computationally efficient
- sensitive to noisy data & outliers, result strongly depends on initial values & often terminates at local optimum, requires a k beforehand, assumes convex (roundish) clusters
- commonly used: Lloyd Algorithm (Voronoi iteration)
- how to choose k? Elbow method: for each k, compute the within-cluster sum of squared errors (WCSS), then plot k against WCSS. Where the rate of decrease sharply slows down, choose k
Hierarchical Clustering
- builds a hierarchy of clusters (tree), also called dendrogram
- two approaches:
  - agglomerative (bottom-up, start with each point as a cluster and merge step-by-step): during each step, the two clusters closest to each other are merged (for this, different metrics like Euclidian distance, maximum distance, manhattan distance etc. are possible). If cluster contains several instances: use fusion algorithm to determine the distance
  - divisive (top-down, start with one big cluster & split)
- k in advance is not required, but by cutting the tree number of clusters can be chosen
DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

How is the performance of an unsupervised learning algorithm determined?

The performance of an unsupervised learning algorithm, specifically clustering algorithms, can be measured by the Silhouette Score.

What is considered as Big Data?

Big Data refers to data sets larger than 30-50 Terabytes which are so big & complex that it’s difficult to manage, process and analyze them using conventional technologies and tools.

What are the Data Types?

strucured Data (Databases, Data Warehouses)
semi structured Data (XML, json)
unstructured Data (Text (E-Mails, tweets), audio, videos, logs)
Raw/Real time sensor data (raw satellite signals, real time seismometer data)

What are the Characteristics of Big Data nowadays? (6 V’s)

Volume: 2.5 billion GB of data produced every day (2017), lots of data stored
Velocity: 90% of data has been generated in the last 2 years
Variety: only 10% of data is structured, rest is unstructured as videos, images, audio, text, etc.
Veracity: inconsistencies and uncertainity in data, is data for the intended task relevant?
Value: obtain business value - obtain/unlock new business opportunities
Variability: fast changing structure and meaning of data

What is Hadoop? What are its components?

Hadoop is a collection of open-source libraries for storing and processing large datasets — especially when the data is too big for a single computer. It was designed for huge volumes of Data, data that grows fast and several types of data (unstructured, semi structured and structured).

Each library is dedicated to performing its own specific tasks, which makes the storage & processing of Big Data more efficient.

Hadoop has two main components: HDFS (Hadoop Distributed File System) for storage and MapReduce for processing.

How does Hadoop work?

The idea of Hadoop is that the storing and processing of the data happens in the same place, removing the bottleneck of having to transfer data somewhere else to process it. Instead, a program/function is transferred to the data.

In the HDFS (Hadoop Distributed File System), big files of data are split into (large) blocks and stored redundantly across different nodes (so that when one node fails, the data isn’t lost). It can store various types of data in its native format without transformation, e.g. xml files, json files, structured data as well as unstructured data.

Then, with MapReduce, a programming model for processing data in parallel, the data can be analyzed. It works in two phases: Map (master node distributes tasks to slave nodes & runs them in parallel accross chunks of data) and Reduce (aggregates the outputs from the Map phase into a final result, e.g. by count, average).

Give examples of distance / (dis-)similarity metrics for Hierarchical Clustering.

Euclidian distance
Manhattan distance
Maximum distance
Mahalanobis distance

What are fusion algorithms for? Name 6 of them.

Fusion algorithms determine how to merge clusters when building the hierarchy in Hierarchical Clustering (specifically agglomerative).

At each step of the fusion algorithm, you look at all pairs of clusters and decide which two clusters are the most similar (here the algorithn calculates the distance between the clusters), then merge them into one cluster.

Examples are Single-Linkage, Complete-Linkage, Average Linkage, Centroid Method, Median Method and Average-Group-Linkage.

Explain how the different fusion algorithms determine the distance between clusters.

Single linkage: Distance = minimum distance between any point in one cluster and any point in the other
Complete Linkage: Distance = maximum distance between any point in one cluster and any point in the other
Average linkage: Distance = average of all paireise distances between points in the two clusters
Average-Group-Linkage: distance = average distance of all element pairs from the union of A and B
Centroid Method: distance = distance between the centers of two clusters
Median Method: distance = distance distance between the center of cluster A and the mean center of the merged other clusters

How does the Silhouette Score work?

The Silhouette score is a metric to evaluate how all data points have been clustered. It measures how similar each point is to its own cluster compared to other clusters.

Ranges from -1 (point close to its own cluster and far from neighboring clusters) to 1 (point closer to another cluster than to its own).

Beitreten

Vorschau

Author

Julia S.

Informationen

Zuletzt geändert
vor 4 Tagen

Kurs melden