Ex 02. ML Part 2

Buffl

Cognition

by Jensen J.

What is stored in the model M of a machine learning system?

it
- encodes
- stores
- retrieves
the outcome of the learning process

To what does the adaption of M by a learnign algorithm A to find optimal model M* correspond?

search in the hypothesis space

Do we usually search for a global minima in the hypothesis space?

in most practical applications not possible to find global minima
=> most learning algos search for local optima

What does the hypothesis space H determine?

which aspects of the data are captured
- e.g. simple statistics or probabiltiy distributions
and how they are represented
- e.g. implicit or human-readable

What does the choice of hypothesis space allow for?

incorporation of a priori knowledge about the learning task
(at least to different degrees…)

What are some common hypothesis spaces?

decision trees
polynomials
logic
neural networks
bayesian networks
markov models

What are ensemble methods?

simple way to extend hypothesis space
=> by combining a set of hypotheses h1, h2, …, hn element H
to a new hypothesis h* element H^n

What do ensemble methods allow for?

build strong learner by combining several weak lerners

What is boosting?

boosting algorithms compute strong learner
by incrementally constructing an ensemble of hypothesis

How does boosting roughly work?

assign weight to every trainign sample
- -> initially set to same value
weights of incorrectly learned samples are increased
training of new hypothesis focuses on samples with high weights

How does an exemplary graphical decision tree regerssion look with and without boosting?

How can the performance of a hypothesis h element H be characterized?

by how well it predicts
- seen data from the dataset
- and unseen data

What are the different names for performnce on training and on unseen data?

performance of h on training data
- i.e. value of objection funciotn L for h and S
- -> Fit
preformance of h on unseen data
- -> Generalization

What issues w.r.t. performance outcomes can we see? (two options=

underfitting
- bad fit
- does not fint underlygin process, as not expressive enough
overfitting
- good fit
- bad generalization

What is the meaning of occams razor?

“Of two competing theories, the simpler explanation of an entity is to be preferred”
in general:
- overfitting as result of model having more parameters than required to capture properties of process generated by S
- desirable to choose model as simple as possible
  - => occams razor states the so called law of parsimony

How can occams razor be for example applied? (image)

What is the difference between generative and discriminative models?

discriminative based on posterior probabilities
- P(y|x) -> compute prediction x for an input y
generative based on prior probabilities
- P(x|y) -> model Distribution D that generate S directly
- => compact representation of training data
  - having considerably less parameters than dataset S
- can be computed using bayes theorem

What can generative query networks be used for (w.r.t camera views)?

learn internal scene representations in a representation network
=> generate and predict visual appearance of the scene through a generation network

-> kind of learn world frame and query view from certian camera angle…

How can overfitting be detected?

applying model h to unseen data samples
then compute objective funciton L

In what subset is are datasets usually splitted (especially in supervised learning)?

training set
- samples used in training phase by learning algo
- to search for hypothesis h in hypothesis space H
validation set
- set of samples used to assess performance of hypothesis h that was computed in training phase
- -> based on performance of h, parameters of training phase can be adjusted
test set
- set of samples used to assess performance of final model

When is a test set invalidated?

if h (hypothesis) is modified based on the performance of the test set

Why is cross-validation useful?

in typical real-world applications
-> data generated by physical process and thus limited
k-fold cross validation enables efficient use of dat at cost of computational complexity

How does k-fold cross vaildation work?

partition dataset into k subsets and learned in k iterations
in every iteration, different subset selected as validation set
overall performance corresponds to average performances of the k iterations

What are some methods to avoid overfitting?

regularization
more training data
- incrase size (i.e. complexity) of dataset
dataset augmentation
- if not enough data available
- -> increase by applying transformaitons to training samples

How does regularization work?

overfitting due to model complex enough to capture statistical data that does not explain the data (i.e. noise)
-> introduciton of regularization term in the objective funciton L
=> that guides learning process towards simpler solution by punishing complexity

What differnt model characteristics are there (or at least did we discuss / mention)?

deterministic vs stochastic
parametric vs nonparametric
generative vs discriminative

What value domains do we differntiate?

discrete (classification)
continuous (regression)

What reasoning categories do we have?

inductive
- learn general model from specific examples
deductive
- use model to make predictions
transductive
- use specific examples to make predictions (at certain point of interests rather than general approximator…)

What defines an analog artificial neuron?

synaptic weights
- weights and bias
activation funciton

What is the perceptron?

linear classifier based on single neuron
with a threshold funciton

What inequality holds in perceptron?

wTx negative -> yn = -1
- => > 0
wTx positive -> yn = 1
- => > 0

What is the perceptron criterion?

only punishes incorrectly classified samples n element M

How does a perceptron learn?

use perceptron criterion and compute the gradient (basically loss….)

apply stochastic gradient descent with learning rate

update the weights

What properties does the perceptron learning rule have?

if solution exists (i.e. dataset linearly separable)
- -> perceptron learning algo finds solution
- within finite number of steps
  - -> so called perceptron convergence theorem
the solution calculated by the algorithm depends on
- initialization of parameters
- order of presentation of the training samples
algorithm does not converge for not linearly separable datasets

Is the xor problem solvable by the perceptron?

no!
-> not linearly separable

What is the definition of the regression task?

input:
- dataset S
  - {(x0,y0),…,(xn,yn)}
  - with xi element X^K
  - with yi element Y^L
Y^L continuous space (e.g. R^L)
loss function L: Y^L x Y^L -> R

desired output:

regressor h: X^K -> Y^L
that predicts yn for a given yn with minimum loss L

What are linear basis function models?

class of hypothesis spaces
that are linear combinations of funcitons on the samples xn

where phi_j are the basis fucitons of the model
where phi_0(x) = 1
values phi_j(xn) are called features of xn

What are some exemplary basis functions?

linear functions

powers of x_i

radial basis functions

sigmoid functions

What is the sum of squares error function?

L2 loss…

What is the sum of squares error function with added regularization term?

L2 loss
plus
puishment for large weights

What is the difference between interpolation and regression?

interpolation:
- must be consistent with S: f(xi) = yi
regression:
- hypotehsis h should minimize L and generalize well to new samples

How can one learn w with least squares?

minimize L2 loss with:

What is the geometrical interpretation of least squares minimization?

orthogonal projeciotn of vector y
to vector h_w* (hypothesis with optimal weights)
in the subspace spanned by teh column vectors of the matrix phi

How can we model perdiction error with maximum likelihood?

interpret prediction error of hypothesis as noise term epsilon

model noise (usually) using gaussian distribution with mean hw(xn) and precision (inverse variance) beta

How can we learn with maximum likelihood?

compute the arg max w.r.t. w so that the likelihood of correct predictions is maximized
often easier to maximize the log likelihood

-> as we mostly have minimizers (numerically more stable…) minimize negative log likelihood…

Whe is maximum likelohood equivalent to least squares?

for linear basis fuctions
when assuming gaussian distributino of prediction error

What is the definition of the classification task?

Input:
- labeled dataset
- C = {C1, …, CL} -> discrete space of classes C1,…,CL
- we only consider assigning inputs exactly one class…!
  - => partitioning the input space X^K into decision regions
- Classes represented as 1-of-K coding scheme (Cl = (0,0,…,1,…,0) element {0,1} ^K
output:
- classifier h: X^K -> C that predicts yn for given xn with minimum error

What is binary classification?

space of classes C = {0,1}

How can one generalieze linear models for classificatoin?

use linear regression model
add nonlinear activation function f that discretizes the output space

->y_cls is called the generalized linear model

What is a decison surface?

subspaces

-> constant c

=> are linear functions of x…

What is the connection betwen artificial neurons and generalized linear models?

artificial neurons are generalized linear models!

What is the most common activation function for linear separation ?

heaviside step function

=> uses decision surface with

How can classifiers for more than two classes be constructed?

in genearl: combining discriminative functions (functions that separate datapoints based on decision surface…) that classify subsets of the classes

What are the two most evident approaches for linear multi class classification ?

one-versus-the-rest classifier (top)
one-versus-one-classifier (bottom)

How do one-versus-the-rest classifier work? What is a problem with it?

separates K classes with K-1 discriminatn functions
every discriminant function separates one class from all others

-> some regtions (?) are classified ambiguously

-> here: is it C1 or is it C2?

How do one-versus-one classifiers work?

pairwise separation of K classes with K(K-1)/2 binary discriminant funcitons
class of data sample assinged by majority vote
- -> for all classificatoins betwen two classes, which class was assigned most often?

-> problem: some regions classified ambiguously (here due to draw in majority vote…!)

Again, what is the problem in one-versus-the-rest and one-versus-one linear multiclass classification?

discriminant functions overlap or do not cover the whole input space

How can one solve the problems of linear multi class classification (one vs…)?

determine partition of the input space
assign regular linear function (without discretization) to all classes (regular decision boundary)
calculate the output for a sample
=> assign to class where the output is largest…

What is a problem with naively represnting classes in the form Class 1 = 1, Class 2 = 2,…? How to solve it?

implies not existing order that might be captured by the learned model
=> use one hot encoding

How does one-hot-encoding work?

for n dimensions, create n-dimensional vector
i-th class is represented with (0,0,…,1,…,0) where only the i-th element is 1 and the rest 0

What does one-hot-encoding allow for?

enables mapping classification problem to regression problem
-> make use of least squares

How is least squares for regression modeled?

individual classes represented by own liear model

grouped together into classifier

where W is matrix compromised of weight vectors wk of individual linear models y_Ck

How is the loss fucntion in least squares for classification defined?

L2 loss (0.5(y_hat)^2)
-> Trace Tr ensures that only quadratic terms on diagonal of product matrix account for the loss

What is the least squares solutoni for classification?

What is a problem with least-squares classificatoin?

sensitive to outliers!
=> as hyperplane (decision boundary) depens on all samples of dataset…
=> algo computes one of infinitely many solutions!

What is a better alternative to least-squares linear classification?

support vector machines

What is the idea of SVM?

minimize generalization error by maximizing the margin of the classifier
=> margin: smallest distance between the decisino boundary and the training samples
- -> only closest samples to boundary are important => so called support vectors…

How can one calculate the distance of a sample xn from the decision boundary?

-> minimize ||w|| to maximize the distance…

How can one ensure better generalization performance in SVM?

introduce slack variable
-> allowing for some training samples lying inside the margin or on the other side…

Join Course

Preview

Author

Jensen J.

Information

Last changed
2 years ago

Report course