What is a neural network?
ML model that mimics the human brain
NN is given input —> does computations in its ‘system’ —> gives an output
biological neuron vs. artificial neuron
What does a neural network consist of mathematically?
Forward propagation: Lots of matrix multiplications, btw. weights and nodes, with bias addition. Along with activation functions to break non-linearity.
Backpropagation: Applying calculus (derivation —> chain rule) to learn from mistakes.
Having ‘gradients of weights’
Optimization: Applying the gradients to update weights, using an optimizer.
What does a neural network consist of technically?
Input layer, hidden layer(s), output layer
Between neurons —> weights
What is the difference between inference with and training of a neural network?
Training NN: Preparing the model from scratch to perform the task
e.g. recognizing cat pictures
Inference: Using the trained model to have user-specific tasks fulfilled
e.g. uploading your pet picture to see if it is cat
What is a tensor?
Multi-dimensional array for storing data. Can be larger than 2D. Used in DL as main data type (e.g. torch.tensor)
Example: An RGB image of 4x4x3.
If there are 10 RGB images, then: 10x4x4x3
What are model weights/parameters?
Internal variables of the model that affect the output (i.e. value generated)
What are hyperparameters? Give 3 examples.
External variables to the model. Affect how the model weights are updates, from the outside (training)
Examples:
Learning rate
Number of hidden layers
Optimizer & activation function
Batch size
Train-test split ratio
What is hyperparameter tuning?
The process of finding the 'optimal’ hyperparameters (the right combination) from a set of them.
What is a learning rate?
It is a hyperparameter.
Determines how much the model is updated (each iteration)
What is a learning rate scheduler?
Algorithm that ensures we have a dynamic learning rate, rather than a fixed one.
To reach the convergence better.
Higher during the first steps, later becomes smaller.
What is an optimizer?
Algorithm to update model parameters, based on ‘gradients of weights’
Its goal is to minimize the loss function (find the optima)
GD, SGD RMSprop, AdaGrad…
What is Batch Size?
Batch: A subset of the dataset that is processed together during training.
Batch size: Hyperparameter that defines how big that subset should be.
What is an epoch?
An epoch refers to an “all go-through” over the whole dataset.
What is backpropagation and what does it serve for?
Computation method using during model training.
After forward pass.
It consists of a series of derivations (chain rule) applied on the loss function w.r.t to the model weights.
It serves to compute the gradients of weights.
Optimizer uses those gradients to have ‘optimal weights’
So that the predictions resemble the desired output more.
What is a loss function?
Computational method to observe how well the model’s predictions perform numerically.
Measuring how far it is from the desired output.
During training, the usual aim is to decrease loss so that we have well-functioning model.
Many different loss functions available, depends on task.
What are training loss and validation loss?
Training loss: Shows how close the predicted output is to the training data.
Validation loss: Shows how close the predicted output is to the new (unseen) data.
What is a training loop and what are typical steps within a training loop in the sense of deep learning?
Iterating through the whole dataset, model learning (i.e. adjusting parameters) from the data.
Steps (for each iteration, until convergence):
Forward Pass —> generating output from input given.
Calculating the loss (from the output).
Backpropagation —> computing gradients based on loss
Optimization —> applying gradients to optimize weights.
What is overfitting conceptually?
Focusing too much on the patterns of the training data such that failing when new data is presented.
NOT ‘learning’ actually
Being ‘lazy’ (memorizing patterns, not able to generalize)
How can overfitting be easily seen during the training?
When the training loss is low/decreasing BUT the validation loss is high/not decreasing.
Explain the concept behind L1-L2 regularization. Why do we need such thing in the first place?
To prevent overfitting.
Introducing ‘penalties’ to the loss function so that it is not ‘memorized closely’
What is early stopping?
Regularization technique to prevent overfitting
Training ‘stops’ earlier than it should —> when the training loss decreases but validation loss increases.
Monitoring training & validation processes
train -> validate -> compare ==> train -> validate -> compare …
What is the difference between classification and regression model?
Classification: Prediction based on discrete labels. Finding the ‘best separating line’
Regression: Prediction based on continous data. Finding the ‘most fit line’
What metrics can be used to evaluate a classification model? Name at least 2. Describe differences
Accuracy: Problematic when there is imbalance in the dataset (distribution of classes).
Recall, FPR, precision, F1-score better in that case.
Recall: when FN is more costly than FP.
FPR: Not useful there aren’t many Negatives. FP is more costly than FN.
Precision: Focus tracking Positive predictions.
F1-score: Balancing between Precision & Recall.
What is the Transformer architecture?
The main NN architecture behind popular NLP models: BERT, GPT
Consists of: ‘encoder’ & ‘decoder’ blocks
Self-attention: Captures contextual relationships better than RNN, LSTM.
Can be parallelized, since we do not work serially — checking the input (tokens) simultaneously.
What is tokenization in the sense of NLP? Why is it needed?
Splits the text input to digestible chunks for the model
Makes it easier for the model to identify patterns
Various methods: Word-based, char-based, subword-based
What are embeddings?
Numerical representation of data which was in another format (e.g. text, image)
while keeping the semantic information
What are differences between pre-training and fine-tuning of models?
Pre-training: Taking a model, initializing the weights, and train the model from scratch.
More general.
Large, unlabeled corpora (training)
Demanding (computationally and data-wise).
Fine-tuning: Done on the pre-trained model, with its parameters.
More task-specific.
Less resources needed.
Labeled data.
Last changeda month ago