What are three important datatypes?
Tabular data
Images
Sequences
What is tabular data and what are key properties of tabular data?
Tabular data stores data in vertical columns and horizontal rows
Each row and column is uniquely numbered
Key properties
Tabular data has infinite range for data
Each record (i.g. row) shares the same set of properties
Each column is usually assigned with a header
Each object can be retrieved by a query trough key values
What is the suitable terminology regarding tabular data?
sample
feature
feature vector
class
label
Sample: Each entry of the data
Feature: Set of belonging properties for each sample
Feature vector: Different features can be represented as a feature vector
Class: Belonging type of each entry
Label: Lists each samples class
Which problem does dimensionality solve and what are popular algorithms regarding dimenionality reduction?
Dimensionilty reduction makes it possible to visualize data with many features
It downprojects n-dimensional data to easily visualized data (often to 2D or 3D) while preserving information as much information as possible
Popular algorithms
t-SNE (t-distributed stochastic neighbor embedding)
PCA (Principal component analysis)
What are clustering algorithms?
Clustering algorithms try to group unlabeled data to similiar and dissimiliar samples
How are images representated, what is color depth and how does the RGB-model work?
Images are represented within three dimesions
Height
Width
Channels (typically: red, green, blue)
Color depth is the number of possible values for each channel of a pixel
(commonly used: 8-bit and 16-bit)
The RGB-model has three channels (red, gree, blue). The final image results from adding all channels together
What is data augmentation and what are advanteges and downsides of data augmentation?
Data augmentation means creating new artificial samples by modifying existing ones
Advantages
Can increase the data points (i.g. samples) with little effort
Reduces overfitting
Increases the robustness of a model
Downsides
Can introduce new artifacts
Can change the task entirely
Heavily dependent on the task, data and model
What are popular data augmentation techniques?
Rotation
Flipping
Zooming/Cropping
Blurring
Noise
Input Dropout
Distortion Effects
Color Jittering
What is a sequence, which data can be displayed with a sequence and what are possible examples of sequences?
A sequence is a datatype which lists values in a certain order
A sequence can essentially display every kind of data but it does not always make sense
Examples
Time series (e.g. weather, stock price)
Positional series (e.g. molecule representation, symbol and word or in a language)
What is supervised machine learning and for what is it typically used?
Supervised machine learning is a machine learning technique, where a model learns from input data with corresponding target values
Predictive modeling –> Use trained model to predict target values for other (new) inputs where the targets are not known yet
What is the suitable terminology regarding supervised machine learning?
Model: parameterized function/method with specific parameter values (e.g., a trained neural network)
Model class: the class of models in which we search for the model (e.g., neural networks, SVMs, . . . )
Parameters: what is adjusted during training (e.g., network weights)
Hyperparameters: settings controlling model complexity or the training procedure (e.g., network learning rate)
Model selection/training: process of finding a model (optimal parameters) from the model class
What are the two most important supervised machine learning tasks and what are their differences?
Classification: target value is a class label (e.g. spam and not spam)
Regression: target value is a numerical value (e.g. house prices)
How does PCA roughly work?
Input: dimension (2D, 3D), unlabeled data
Algorithm: reduces data to desired dimension while trying to preserve as much information as possible
Output: Scatter plot of data points and their according class labels, note: this is only possible with corresponding plotting functions
How does t-SNE roughly work?
Input: dimension (2D, 3D), unlabeled data, perplexity
Algorithm: reduces data to desired dimension including a certain randomness (i.e. perplexity), while trying to preserve as much information as possible
How does Affinity propagation roughly work?
Input: unlabeld and reduced data
Algorithm: tries to cluster the data, the number of clusters is learned by the algorithm
Output: Scatter plot of data points and their according clusters, note: this is only possible with corresponding plotting functions and clusters are not equal to class labels
How does k-means roughly work?
Input: Number of clusters, reduced and unlabeled data
Algorithm: Tries to cluster the data, the number of clusters is defined by the user
What is the most obvious way to plot time series data?
Line plots, since the data depends on time and can therefore be approbiately visualized in that way
What is often problematic with image data and which type of neural networks are helpful with processing image data?
Image data is usually highly dimensional
Convolutional networks come in handy when dealing with images
How can an image be portrayed with a feature vector?
The pixels of the image are simply flattened out, s.t. a grayscaled 28x28 image can be translated into a feature vector with 728 elements
Consider the following code of a convolution:
torch.nn.Conv2d(1, 10, 5)
How many kernels are applied, what size and what dimension
do they have?
Dimension of kernels: 1
Number of kernels: 10
Size of kernels: 5
How does k-nearest neighbour algorithm roughly work?
Input: Reduced and unlabeled data, number of k-nearest neighbours
Algorithm: The algorithm is trained on a training data set. It chooses the class for a datapoint accordingly to the class, which is inherited by the most k-nearest datapoints. It is finally evaluated on the test set
Output: Accuracy on the training and test set
Last changeda year ago