What are the types of Machine Learning Algorithms? Explain them shortly.
supervised learning: usage of training data which is labelled to train the algorithm which creates a model (hypothesis). This model can then be used to predict the labels of unlabelled data.
unsupervised learning: The algorithm puts unlabelled data into clusters. It can’t predict the labels without training data, but can cluster the data by similarity. To label the data, it has to be interpreted manually.
reinforcement learning: With reinforcement learning, the algorithm learns from rewards of previous decisions when they were correct. That way, the algorithm develops a strategy to fulfill a goal.
What are the types of Machine Learning Problems? Explain them shortly. Which machine learning algorithms can be used to solve them?
Regression: Regression is a supervised learning problem. It uses the input data data to fit a hyperplane in space (2D case: line) which predicts values based on these input values. Therefore, the answer is a continuous value (e.g. y = 2x + 3). It estimates the relationship between two or more variables. Example: House prize vs. size. When, in the 2D case, the line is straight (like below), this is called Linear Regression.
Classification: Classification is a supervised learning problem. It uses input data to fit a hyperplane in space which seperates two or more classes. With this model, the class of new, unlabelled data can be predicted like benign vs. malignant tumors (see below),predict soil type from data etc. The answer is discreet (x is either in class 1 or 2).
Segmentation: Segmentation is an unsupervised learning problem which provies a set of clusters. It groups similar data into clusters (the number of clusters can be determined ot eh algorithm determines them itself). It predicts properties by its closeness to the centroid of the clusters.
What is Exploratory Data Analysis and which steps does it include?
Exploratory Data Analysis is the first step towards building a machine learning model. It is understanding the data.
It includes:
Data manipulation: normalization of data when dealing with huge datasets, data filtering/cleaning, replace no data values, export to common formats
Data plotting: visualization of data
Descriptive statistics: summarize data in any way (variances, mean values…), quantify dependency of two data sets, for example latitude and temperature
What are the three ways to analyze data in terms of analyszing variables seperately or together?
univariate: Each variable is analyzed seperately
bivariate: Two variables are analyzed together to look for correlation/separation
multivariate: > 2 variables are analyzed together; here it’s usually difficult to visualize the data and the results
What are the 4 types of variables and what are the possible math operations that can be done to them? How can these 4 variables be generalized?
qualitative variables
nominal (equal or not equal): Examples: Is an area a desert or a forest? -> no ranking applied
ordinal (<, >): Has info on rank/hierarchy, for example: cost of living is high, medium or low; seismic scales
quantitative variables
interval (+,-): includes numerical values and information can be arranged along a scale, e.g. temperature in °C. There are only a few examples of this. The difference to ratio data is that interval data has no natural 0 (20°C is not 2x warmer than 10°C, 0°C doesn’t mean “no temperature”.)
ratio (*,/): Like interval data but with natural 0 (either physical or convention like for elevation). Examples: precipitation, elevation, load capacity of roads, temp in Kelvin (because 0 K = particles have no thermal energy, therefore no “temperature”, 200K is twice as hot as 100K in terms of particle energy). This is the most informative scale.
Explain the Linear Regression model.
The Linear Regression model is a linear model that predicts a target (continuous) value by computing a weighted sum of the input parameters (called features) plus the intercept term (also called bias term). The weights are the model parameters theta, and the intercept term is theta0.
This structure comes from the general formula of a straight line, which is y = ax + b, where b is the bias term and a is the model parameter which determines the slope of the line. These are the values to be determined and fit best to the training data.
The function y_hat is then called the hypothesis function with
With this, there would be one more theta value than x value due to the intercept term, which is why the feature vector must have one more feature = 1.
How do we know which model fits the training data the best? Explain the function.
With help of the cost function. There are several cost functions, the most common one is Root Mean Square Error (RMSE), but the Mean Square Error (MSE) is actually better in this case because it takes less computation time due to the absence of the root. Otherwise, these functions are the same and look like this:
In words:
The Mean Square Error is a cost function which calculates the mean of the squared difference between the hypothesis function and the target values of the input.
The result is a scalar. The bigger the difference (=error) between the hypothesis function and the target (=actual) values, the bigger is the MSE. So, if the MSE decreases, the model fits better.
The best fitting model is therefore a model which produces the smallest MSE. This provides the best set of theta (theta_hat) which minimizes the cost function.
Why do we square the error of the MSE/RMSE?
So that every error is positive.
Smaller errors < 1 don’t have such an impact when squaring the error. So with this, larger errors have much more of an impact on the MSE which determines how good the linear regression model fits. This is important because it is preferred to have a model which makes a lot of small errors than to have a good fitting model that sometimes makes big errors.
What do wee need to watch out for then computing a Normal Equation?
A Normal Equation, which is a nxn matrix, has a time complexity of about O(n^2.4) to O(n³). This means that if we double the amount of features (variables) in our model n, the computation takes 2³ = 8 times longer. This is something we need to keep in mind.
Instead, the Normal Equation is linear to the number of training data instances m.
This means in a dataset of, say, trees, it’s no problem to have lots of trees in the dataset but if we don’t only account for height and trunk width but additionally for leaf size and thickness of the bark then the computing time for the Normal Equation increases a lot.
If the normal Equation can’t be computed easily or the data set is very large, a method like Gradient Descent can be used.
Explain the principle of Gradient Descent.
Gradient Descent is a optimization algorith for finding optimal solutions for a wide range of problems by minimizing a function (often a cost function).
It starts by initializing theta with (small) random values and gradually improving theta by decreasing the cost function, similar to Least Squares Adjustment. The function stops when theta converges to a minimum.
For that, the Gradient Vector has to be computed which is eseentially just a vector of the partial derivatives of the cost function in regard to theta. This is done in order to determine how much the cost function changes (goal is for the derivative to be 0! -> find minimum).
To determine the next step:
The learning rate determines the step size, so it shouldn’t be too small to avoid too many iterations, but it also shouldn’t be too high because then the algorithm might diverge and a good solution may never be found. To find a good learning rate, grid search can be used (test again and again for similar intervals).
When using the Gradient Descent, what could happen when the cost function is not convex?
A convex function determines that a line segment between any two points on the curve does not cross the curve. Therefore, in a convex function, there aren’t any local minima or plateaus.
A non-convex function might have local minima or plateaus. The solution might get stuck in a local minimum or stop on a plateau.
To avoid that, one can use convex functions. Other solutions can be to choose different initial values or increase the learning rate so small ridges can be “overjumped”.
What happens in Linear Regression when the cost function has two or more parameters? How does it look like, what problems can arise?
The cost function won’t be a line anymore but a 2D shape (or more-dimensional).
This cost function can be elongated in one direction is the features have different scales (e.g. a grade ranging from 1-5, but height ranging from 1,50 to 2,50). The features should have similar scales or the converging will take longer!
Therefore, feature scaling can be used to, for example, scale every feature from 0-1.
Zuletzt geändertvor einem Tag