Buffl

Geo Data Science

JS
von Julia S.

What are the different types of Gradient Descent? Explain the differences between them.

There are three types of Gradient Descent. The difference between them is how much data they use to compute the gradient at each step.


Batch Gradient Descent:

  • Uses all the training data to compute the gradient before updating -> slow on large datasets, but scales well.

  • Each step is one epoch (=one complete pass through the entire dataset)


Stochastic Gradient Descent:

  • picks one random training data instance at every step and computes the Gradient on only that instance -> is faster, might have problems with blunders

  • jumps around a lot -> migth help escape local minima!, but ends up very close to minimum wothout reaching it

  • is good for huge datasets that don’t fit into memory as only one instance is needed for every step. Therefore, it only sees one epoch after a few steps, one says: if there are x training instances, the St. Grad. Descent has seen the epoch after x steps (even though that could not be true, because it picks at random!)

  • to lower the drawbacks of randomness: start training with large learning steps to escape local minima; gradually reduce learning steps to settle at global minimum. This is called simulated annealing


Mini-Batch Gradient Descent:

  • gradient is computed on small sets of random instaces, called mini-batches; algorithm doesn’t see all the data, therefore less problems with huge datasets

  • will be closer to minimum than Stochastic Gradient Descent

  • jumps around less -> harder to escape from local minima



Explain the principle of Support Vector Machine and the two classifications.

Support Vector Machine (SVM) is a non-linear model which uses kernels for binary classification, but it can also be used for linear regression and multi-class classification.

The idea is to create the largest margin possible between the class boundaries, and therefore, the best possible line/polynomial (=decision boundary) to seperate between the classes:

This decision doundary is determined by the data instances closest to the decision boundary which are called support vectors. The fitted line is equally influenced by both support vectors (margin should be the same size on either side).

There is hard margin classification and soft margin classification:

Hard Margin Classification determines that all training instances must be outside the margin and on the correct side, like on the left picture below. This makes this method sensitive to outliers:

Soft Margin Classification is more flexible and tries to find a good balance between a large margin and a limited number of margin violations. SMC uses a Hyperparameter C which determines a penalty for the cost function for each misclassifies training sample (C is small -> more misclassified samples allowed -> more regulation, can reduce overfitting):


Therefore, in the (linear) cost function, when a point is outside or on the margin, it doesn’t contribute to the cost; but when it’s in the margin, the cost increases with the distance of the point to the margin (the closer to decision boundary, the higher the cost is).


For non-linear problems, kernel functions need to be used which allow for a non-linear separation but without adding polynomial features. Examples are polynomial and Gaussian Radial Basis Function (RBF).

Author

Julia S.

Informationen

Zuletzt geändert