What is stored in the model M of a machine learning system?
it
encodes
stores
retrieves
the outcome of the learning process
To what does the adaption of M by a learnign algorithm A to find optimal model M* correspond?
search in the hypothesis space
Do we usually search for a global minima in the hypothesis space?
in most practical applications not possible to find global minima
=> most learning algos search for local optima
What does the hypothesis space H determine?
which aspects of the data are captured
e.g. simple statistics or probabiltiy distributions
and how they are represented
e.g. implicit or human-readable
What does the choice of hypothesis space allow for?
incorporation of a priori knowledge about the learning task
(at least to different degrees…)
What are some common hypothesis spaces?
decision trees
polynomials
logic
neural networks
bayesian networks
markov models
What are ensemble methods?
simple way to extend hypothesis space
=> by combining a set of hypotheses h1, h2, …, hn element H
to a new hypothesis h* element H^n
What do ensemble methods allow for?
build strong learner by combining several weak lerners
What is boosting?
boosting algorithms compute strong learner
by incrementally constructing an ensemble of hypothesis
How does boosting roughly work?
assign weight to every trainign sample
-> initially set to same value
weights of incorrectly learned samples are increased
training of new hypothesis focuses on samples with high weights
How does an exemplary graphical decision tree regerssion look with and without boosting?
How can the performance of a hypothesis h element H be characterized?
by how well it predicts
seen data from the dataset
and unseen data
What are the different names for performnce on training and on unseen data?
performance of h on training data
i.e. value of objection funciotn L for h and S
-> Fit
preformance of h on unseen data
-> Generalization
What issues w.r.t. performance outcomes can we see? (two options=
underfitting
bad fit
does not fint underlygin process, as not expressive enough
overfitting
good fit
bad generalization
What is the meaning of occams razor?
“Of two competing theories, the simpler explanation of an entity is to be preferred”
in general:
overfitting as result of model having more parameters than required to capture properties of process generated by S
desirable to choose model as simple as possible
=> occams razor states the so called law of parsimony
How can occams razor be for example applied? (image)
What is the difference between generative and discriminative models?
discriminative based on posterior probabilities
P(y|x) -> compute prediction x for an input y
generative based on prior probabilities
P(x|y) -> model Distribution D that generate S directly
=> compact representation of training data
having considerably less parameters than dataset S
can be computed using bayes theorem
What can generative query networks be used for (w.r.t camera views)?
learn internal scene representations in a representation network
=> generate and predict visual appearance of the scene through a generation network
-> kind of learn world frame and query view from certian camera angle…
How can overfitting be detected?
applying model h to unseen data samples
then compute objective funciton L
In what subset is are datasets usually splitted (especially in supervised learning)?
training set
samples used in training phase by learning algo
to search for hypothesis h in hypothesis space H
validation set
set of samples used to assess performance of hypothesis h that was computed in training phase
-> based on performance of h, parameters of training phase can be adjusted
test set
set of samples used to assess performance of final model
When is a test set invalidated?
if h (hypothesis) is modified based on the performance of the test set
Why is cross-validation useful?
in typical real-world applications
-> data generated by physical process and thus limited
k-fold cross validation enables efficient use of dat at cost of computational complexity
How does k-fold cross vaildation work?
partition dataset into k subsets and learned in k iterations
in every iteration, different subset selected as validation set
overall performance corresponds to average performances of the k iterations
What are some methods to avoid overfitting?
regularization
more training data
incrase size (i.e. complexity) of dataset
dataset augmentation
if not enough data available
-> increase by applying transformaitons to training samples
How does regularization work?
overfitting due to model complex enough to capture statistical data that does not explain the data (i.e. noise)
-> introduciton of regularization term in the objective funciton L
=> that guides learning process towards simpler solution by punishing complexity
What differnt model characteristics are there (or at least did we discuss / mention)?
deterministic vs stochastic
parametric vs nonparametric
generative vs discriminative
What value domains do we differntiate?
discrete (classification)
continuous (regression)
What reasoning categories do we have?
inductive
learn general model from specific examples
deductive
use model to make predictions
transductive
use specific examples to make predictions (at certain point of interests rather than general approximator…)
What defines an analog artificial neuron?
synaptic weights
weights and bias
activation funciton
What is the perceptron?
linear classifier based on single neuron
with a threshold funciton
What inequality holds in perceptron?
wTx negative -> yn = -1
=> > 0
wTx positive -> yn = 1
What is the perceptron criterion?
only punishes incorrectly classified samples n element M
How does a perceptron learn?
use perceptron criterion and compute the gradient (basically loss….)
apply stochastic gradient descent with learning rate
update the weights
What properties does the perceptron learning rule have?
if solution exists (i.e. dataset linearly separable)
-> perceptron learning algo finds solution
within finite number of steps
-> so called perceptron convergence theorem
the solution calculated by the algorithm depends on
initialization of parameters
order of presentation of the training samples
algorithm does not converge for not linearly separable datasets
Is the xor problem solvable by the perceptron?
no!
-> not linearly separable
What is the definition of the regression task?
input:
dataset S
{(x0,y0),…,(xn,yn)}
with xi element X^K
with yi element Y^L
Y^L continuous space (e.g. R^L)
loss function L: Y^L x Y^L -> R
desired output:
regressor h: X^K -> Y^L
that predicts yn for a given yn with minimum loss L
What are linear basis function models?
class of hypothesis spaces
that are linear combinations of funcitons on the samples xn
where phi_j are the basis fucitons of the model
where phi_0(x) = 1
values phi_j(xn) are called features of xn
What are some exemplary basis functions?
linear functions
powers of x_i
radial basis functions
sigmoid functions
What is the sum of squares error function?
L2 loss…
What is the sum of squares error function with added regularization term?
L2 loss
plus
puishment for large weights
What is the difference between interpolation and regression?
interpolation:
must be consistent with S: f(xi) = yi
regression:
hypotehsis h should minimize L and generalize well to new samples
How can one learn w with least squares?
minimize L2 loss with:
What is the geometrical interpretation of least squares minimization?
orthogonal projeciotn of vector y
to vector h_w* (hypothesis with optimal weights)
in the subspace spanned by teh column vectors of the matrix phi
How can we model perdiction error with maximum likelihood?
interpret prediction error of hypothesis as noise term epsilon
model noise (usually) using gaussian distribution with mean hw(xn) and precision (inverse variance) beta
How can we learn with maximum likelihood?
compute the arg max w.r.t. w so that the likelihood of correct predictions is maximized
often easier to maximize the log likelihood
-> as we mostly have minimizers (numerically more stable…) minimize negative log likelihood…
Whe is maximum likelohood equivalent to least squares?
for linear basis fuctions
when assuming gaussian distributino of prediction error
What is the definition of the classification task?
Input:
labeled dataset
C = {C1, …, CL} -> discrete space of classes C1,…,CL
we only consider assigning inputs exactly one class…!
=> partitioning the input space X^K into decision regions
Classes represented as 1-of-K coding scheme (Cl = (0,0,…,1,…,0) element {0,1} ^K
output:
classifier h: X^K -> C that predicts yn for given xn with minimum error
What is binary classification?
space of classes C = {0,1}
How can one generalieze linear models for classificatoin?
use linear regression model
add nonlinear activation function f that discretizes the output space
->y_cls is called the generalized linear model
What is a decison surface?
subspaces
-> constant c
=> are linear functions of x…
What is the connection betwen artificial neurons and generalized linear models?
artificial neurons are generalized linear models!
What is the most common activation function for linear separation ?
heaviside step function
=> uses decision surface with
How can classifiers for more than two classes be constructed?
in genearl: combining discriminative functions (functions that separate datapoints based on decision surface…) that classify subsets of the classes
What are the two most evident approaches for linear multi class classification ?
one-versus-the-rest classifier (top)
one-versus-one-classifier (bottom)
How do one-versus-the-rest classifier work? What is a problem with it?
separates K classes with K-1 discriminatn functions
every discriminant function separates one class from all others
-> some regtions (?) are classified ambiguously
-> here: is it C1 or is it C2?
How do one-versus-one classifiers work?
pairwise separation of K classes with K(K-1)/2 binary discriminant funcitons
class of data sample assinged by majority vote
-> for all classificatoins betwen two classes, which class was assigned most often?
-> problem: some regions classified ambiguously (here due to draw in majority vote…!)
Again, what is the problem in one-versus-the-rest and one-versus-one linear multiclass classification?
discriminant functions overlap or do not cover the whole input space
How can one solve the problems of linear multi class classification (one vs…)?
determine partition of the input space
assign regular linear function (without discretization) to all classes (regular decision boundary)
calculate the output for a sample
=> assign to class where the output is largest…
What is a problem with naively represnting classes in the form Class 1 = 1, Class 2 = 2,…? How to solve it?
implies not existing order that might be captured by the learned model
=> use one hot encoding
How does one-hot-encoding work?
for n dimensions, create n-dimensional vector
i-th class is represented with (0,0,…,1,…,0) where only the i-th element is 1 and the rest 0
What does one-hot-encoding allow for?
enables mapping classification problem to regression problem
-> make use of least squares
How is least squares for regression modeled?
individual classes represented by own liear model
grouped together into classifier
where W is matrix compromised of weight vectors wk of individual linear models y_Ck
How is the loss fucntion in least squares for classification defined?
L2 loss (0.5(y_hat)^2)
-> Trace Tr ensures that only quadratic terms on diagonal of product matrix account for the loss
What is the least squares solutoni for classification?
What is a problem with least-squares classificatoin?
sensitive to outliers!
=> as hyperplane (decision boundary) depens on all samples of dataset…
=> algo computes one of infinitely many solutions!
What is a better alternative to least-squares linear classification?
support vector machines
What is the idea of SVM?
minimize generalization error by maximizing the margin of the classifier
=> margin: smallest distance between the decisino boundary and the training samples
-> only closest samples to boundary are important => so called support vectors…
How can one calculate the distance of a sample xn from the decision boundary?
-> minimize ||w|| to maximize the distance…
How can one ensure better generalization performance in SVM?
introduce slack variable
-> allowing for some training samples lying inside the margin or on the other side…
Zuletzt geändertvor 2 Jahren