What is the generalization error?

The generalization error is the expected loss on future data for a given model

What is the formula for the generalization error?

What is the formula for the empirical risk?

What is a common assumption when estimating the generalization error via the empirical risk?

That the data is independent and identically distributed (i.i.d)

Does a good estimation for the generalization error via the empirical risk require that the data is Gaussian distributed?

No, that does not have to be the case

Does a good estimation for the generalization error via the emprical risk require that there is a large number of samples?

Yes. A high number of samples is crucial for a good estimation of the generalization error

Does a good estimation for the generalization error require a differentiable loss function?

No, infect not. The 0-1 loss is not differentiable

Does estimating the risk via samples, which have not been used in training avoid the problem of underfitting?

No. Underfitting is a problem which results from a model which is to coarse to fit to the data

Does the model variance decrease with increasing complexity of the model?

No. The model variance increases with increasing complexity

Do higher degrees of freedom imply a lower risk for overfitting?

In fact not. The higher the degrees of freedom, the higher the risk of fitting to noise

Does supervised machine learning use explicit knowledge to design models deductively?

No. Supervised machine learning tries to design models inductively with given training data

Does the probabilistic model, the Optimal Bayes classifier predict the most probable outcome for a new sample and uses a loss function?

Yes. The Optimal Bayes Classifier makes the most probable prediction and uses e.g. the 0-1 loss

What indicates underfitting?

A large error on train and test set

Do ROC curves allow to evaluate classifiers independently of class distribution and misclassification cost?

Yes, they do. ROC are designed to assess a general performance of the discriminant function

Is it possible that cross validation helps to optimize hyperparameters?

Yes, cross validation is designed to support such actions

Does the Bias increase with increasing model complexity?

The Bias descreases with increasing model complexity

What does a very low k in the k-nearest-neighbor algorithm lead to?

It leads to overfitting

Wat does a too high k in the k-nearest-neighbor algorithm lead to?

It leads to underfitting, it would simply assign the class label of the more dominant class

Do high k’s in the k-nearest-neighbor algorithm lead to high complexity?

k is a hyperparameter and is therefore not corellated with the models complexity

Does k-nn work for unlabeled data?

k-nn only works for labeled data, but it works for classification as for regression

Do Kernels transform data into a lower dimensional space where separability can easier be achieved?

No, in fact Kernels project data into a higher dimensional space

What is the formula of the Gaussian Kernel

Does the constrained convex optimimization problem have a unique global solution?

Yes. This is one of the major advantages of SVMs

Do SVMs allow a probabilistic explanation for classification?

No, they don’t. They either classify as class 1 or -1 (in a binary classification setting)

When is a sample a support vector?

If and only if the corresponding Lagrange multiplier is greater 0

What are the different KKT conditions?

alpha1 * h1 = 0

alpha2 * h2 = 0

Result in four cases:

h1 < 0, alpha_1 = 0

alpha2 = 0, h2 < 0

alpha2 > 0, h_2 = 0

h1 = 0, alpha_1 > 0

When do SVMs work effectively?

If the number of dimensions is much larger than the number of samples

Do SVMs use the magnitude of the discriminant function for regression?

Yes. They use the sign and the magnitude, where SVMs in classification only use the sign of the discriminant function

How many classifiers are there in a multiclass classification setting of the one vs. the rest/all approach (SVMs)?

M(M - 1) / 2 classifiers

The classifier with the largest value is chosen

How many classifiers are there in a multiclass classification setting of the one vs. one approach (SVMs)?

There are M classes

The classifier with the majority vote is chosen

Do decision trees perform well on large datasets?

Yes. Since they narrow down the data at each decision step

Are decision trees only fitted for numerical or categorical data?

They can be fit to both types of data

When is a split maximizing the information gain?

If the produced subsets are homogenous

What do decision trees do recursively?

The recursively split the data into subsets

When is the minimal Gini impurity achieved?

When the produced subsets split into 1/M classes

What is information gain?

It is a splitting critirion for decision trees

Is crossvalidation a method to estimate the generalization error?

Yes. In fact, cross validation splits the training data into different batches and uses one as test dataset, while recursively training

Does too small model class complexity to underfitting?

Yes, this is very often the case

Does regularization increase the gap between training error and generalization error?

No. In fact regularization decreases the generalization error

Can the generalization error be computed exactly by using the test data?

No. Using the test data is just an estimate for the generalization error

Does increasing the value of C-SVMS cause the margin to shrink?

Yes. A high value for C means that slack is punished and therefore the margin shrinks

Do the training and test data set have to be disjoint?

Yes. Training and test data set should not have any sample in common

Does supervised machine learning use supervisory signals for predictive modeling?

Yes, supervised machine learning uses labeled training data to design models for future data

What is the form of the dual problem?

What is a common splitting criterion for regression?

Variance reduction

Are decision tree algrithms non-recursive?

In fact they use the recursive approach

What can a new input sample in decision trees in a classification task be assigned to?

A class value or a conditional probability

Do shallow trees tend to underfit and deep trees tend to overfit?

Yes. AS a deeper tree asks for more conditions

What is the form of the entropy (important for information gain)?

What is the form of the Gini purity?

What is the form of the primal problem?

What is the margin (SVM)?

It is the shortest distance between observations and the decision threshold

Last changed5 months ago