What is supervised machine learning?
"Supervised learning is a type of machine learning where the model is trained on labeled data, meaning the input data comes with corresponding correct outputs (labels). During training, the model's predictions are compared with the expected outputs to guide its learning process. The goal of supervised learning is to generalize from the training data so that it can accurately predict outputs for new, unseen data."
What are the advantages and Disadvantages of supervised learning?
Advantages:
Effective for solving real-world problems
Supervised learning can effectively address practical issues, such as spam email detection and object recognition, by learning patterns from labeled datasets.
Learns from past data to predict new outcomes
By training on historical data, the model gains the ability to generalize and make accurate predictions on unseen data.
Disadvantages
High computational cost
Training supervised learning models can require significant time and computational resources, especially for large datasets or complex algorithms.
Data pre-processing is essential and time-consuming
Accurate predictions depend heavily on high-quality input data. Pre-processing tasks like cleaning, normalizing, and structuring data add to the development time and complexity.
Frequent updates required
Supervised learning models must be monitored and updated regularly to maintain accuracy. For example, a spam detection model needs frequent updates to adapt to evolving patterns in spam emails.
Prone to overfitting
Overfitting occurs when the model learns the training data too well, including noise or irrelevant details, and fails to generalize effectively to new data. This leads to high accuracy on training data but poor performance on unseen inputs.
What are some real-life applications for supervised learning?
Evaluate risk in finacial services or insurance domains
Image Classification & visual recognition
fraud detection
What are types of supervised machine leraning models and what are there differences?
Classification is used to predict the class or category of input data (independent variables). The output is a discrete label, such as 'spam' or 'not spam,' or binary decisions like 'Yes' or 'No.'
Regression is used to predict continuous numerical values (dependent variables) based on one or more input features (independent variables). It is commonly used for forecasting (e.g., house price, humidity, or stock values) and analyzing cause-and-effect relationships between variables.
Explain linear regression algorithmn and how to evaluate its performance?
Linear Regression
Number of independent variables is one and the relationship between the independent input variable and dependent output variable is linear.
typical method of linear regression is finding the best fit straight line
to find the best values for a1 and a0 calculate the mean Squared Error (MSE) using the cost function:
the goal is to minimize the MSE (output of the cost function) -> e.g. by using Gradient Descent method to get the best value of a0 and a1. The gradient is calculated by taking partial derivative of the cost function with respect to a1 or a0. The gradient will determines if it necessary to icrease or decrease the values of a1 and a0. Further the leraning rate can be adjusted as a hyperparameter determining how big the steps should be towards the final value. A smaller learning rate will take longer as a bigger one but will also reach the minimum more precise.
R-Square can be used for Performance evaluation. Basically calculating the ratio between the total error of predicted values squared and than summed up and the difference between the actual values and the mean of the predicted squared and than summed up:
The result is between 0% and 100% where 100% means that all data points are located on the fitted regression line.
Explain Binary Classification type
In binary classification the input data can only be devided into two distinct classes which are used for predictions. Therfore it has to output states one of which is refered to as the normal state and the other one is the abnormal state.
Algorithmn:
logistic Regression
Explain Multi-Class Classification type
in multi-class Classification a single input data instance can be assigned to multiple lables/classes simultaneously.
e.g. Tagging an image with multiple labels like "beach," "sunset," and "water."
Algorithm:
K Nearest Neighbours
Just further info:
Example: For an image that is tagged as "cat" and "outdoor," the vector might look like [1,0,1,0][1, 0, 1, 0][1,0,1,0], where each position corresponds to a label like "cat," "dog," "outdoor," "indoor."
Explain Imbalanced Classification type
Referes to a classification where the input data is not equaly distributed. Meaning there might be significantly less examples for one or more classes than for other classes.
Example: Fraud detection, where 99% of transactions are legitimate, and only 1% are fraudulent.
Many machine learning algorithms assume that the classes are balanced. When this is not the case, they tend to favor the majority class, leading to biased predictions.
Algorithms:
One class SVM
Explain the logistic regression algorithm
Similar to linear regression but used for binary classification tasks. Uses linear equation with independent input variables to predict a dependent output value which can be anywhere between -infinity to +infinity. To get a category from the continuous value —> use of sigmoid function to squash the output between 0 and 1.
By defining a threshold (usually 0.5) the output can be assigned to a class (in simple terms 0 or 1)
Explain K-Nearest Neighbor algorithm
This algorithm classifies input data based on surrounding datapoints. The idea behind is that a datapoint will most likly belong to the same class as the most data-points nearby. So the datapoint is going to be put into the nearest cluster.
Pros:
no training needed
easy to implement
Cons:
Slow for large datasets since it must calculate the distance between new datapoints and all already existing datapoints
Data preprocessing is needed like standardisation and normalisation to be able to predict correctly
Explain the confusion Matrix.
Is a tool used to evaluate the performance of a mashine learning model. Is shows the number of correctly predicted variables (true positives and true negatives) along with the falsly predicted variables (false positives and false negatives).
The matrix can be than used to calculate further perfomance indicators such as:
Explain the ROC - Curve
The ROC Curve is a visualization tool used to evaluate the performance of a binary classification model. It plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at various threshold settings.
Threshold of 0: At a threshold of 0, the model classifies all predictions as positive (i.e., every instance is predicted to be in the positive class). This results in:
TPR = 100% (All true positives are classified as positive).
FPR = 100% (All true negatives are incorrectly classified as positive).
Increasing the Threshold: As the decision threshold increases:
The True Positive Rate (TPR) may decrease because fewer instances are classified as positive.
The False Positive Rate (FPR) decreases as well, since fewer negative instances are misclassified as positive.
Ideally, TPR remains high, while FPR reduces as the threshold is adjusted. The optimal threshold usually balances between these two metrics, depending on the specific problem.
The Area Under the ROC Curve (AUC) is a summary measure of the classifier's ability to distinguish between classes.
AUC = 1: Perfect classifier (perfect separation between classes).
AUC = 0.5: The classifier performs no better than random guessing.
AUC < 0.5: The classifier is performing worse than random, i.e., it is making more incorrect predictions than correct ones.
What is cross validation and why is it importnat?
Cross-validation is a method used to validate the performance and stability of a machine learning model during the training phase. The training data is divided into segments (folds) for learning and validation. This approach allows the model to be validated on previously unseen data, ensuring the evaluation is robust.
If the model's accuracy does not change significantly across different folds, it indicates that the model generalizes well to find patterns and, therefore, has good stability.
Problems which cause loosing stability:
Underfitting
Overfitting
What is underfitting and Overfitting?
Underfitting occurs when the model does not fit well to the training data, leading to poor performance on both the training and testing data. This typically happens when the model is too simple or lacks the capacity to capture the underlying patterns in the data.
Overfitting occurs when the model captures even small variations or noise in the training data. As a result, it performs well on the training data but fails to generalize to new, unseen data that does not exhibit the same variations.
Both underfitting and overfitting result in a decreased stability
Explain the following Cross-Validation Models:
K-fold
Leave one out
Stratified K-fold
Divides the training data into KKK similar-sized folds. The model trains on K−1K-1K−1 folds and validates using the remaining fold. This process of training and validating is repeated KKK times, ensuring that each fold is used for validation exactly once while the remaining folds are used for training. Finally, the average of all the validation results is calculated to provide an overall performance metric.
Uses a single sample for validation from the NNN samples in the dataset, while all other N−1N-1N−1 samples are used for training. This process is repeated NNN times, such that each sample is used for validation exactly once. At the end, the average of all the validation results is calculated to evaluate the model’s performance.
Stratfield K-folds
Used for imbalanced datasets where one class has significantly more samples than another. This method forms folds that preserve the class distribution of the input data.
Example: Given 80 images of dogs and 20 images of cats, the method ensures each fold contains the same proportion of classes as the original dataset. For K=10K=10K=10, each fold would contain 8 dogs and 2 cats, maintaining the 80:20 ratio.
What are hyperparameters and how can they be tuned?
Hyperparameters are variables that are manually set and that change/influence the learning process of a machine learning model.
The goal is to find the best values for the specified Hyperparameters to get the best result from cross validation.
Basic options for tuning the hyperparameters are:
GridSearchCV:
tries all different combinations of hyperparameter that are specified in a grid during training and finds the one that works best.
e.g. c = {1, 10, 100}, gamma = {0.1,0.01,0.001}
RandomizedSearchCV
takes fixed size of random combinations from the parameter space based on the probability distribution
Explain Support Vecto machine and its purpose?
Support Vector Machine (SVM) is a machine learning algorithm that can be used for classification, regression, and outlier detection tasks.
In classification, SVM identifies the optimal boundary (hyperplane) that separates distinct classes. The goal is to find the hyperplane with the maximum margin, which is the largest distance between the hyperplane and the nearest data points of each class (called support vectors). A wider margin reduces the risk of misclassification and improves the model's ability to generalize to new, unseen data.
In simple cases, the hyperplane is a straight line (in 2D space) or a flat plane (in higher dimensions). However, when the data is not linearly separable (e.g., samples overlap), SVM uses the kernel trick. The kernel trick transforms the data into a higher-dimensional space where the classes become separable, without explicitly calculating the transformation. This allows SVM to handle complex, non-linear relationships efficiently.
creates a scatter-matrix to visualize bivariate relationships between combinations of variables
Creates a list color_list where each element is 'red' if the corresponding value in the 'class' column of the data DataFrame is 'Abnormal', otherwise 'green'.
color_list
data
Uses pd.plotting.scatter_matrix to create a scatter matrix plot of all columns in data except 'class', with points colored according to color_list.
pd.plotting.scatter_matrix
Sets the figure size to 15x15, uses histograms on the diagonal, sets transparency to 0.5, point size to 200, and marker style to '*'.
Saves the plot as 'Scatter_matrix.png'.
Displays the plot using plt.show().
plt.show()
create a count plot to visualize wich variables are in the count column and their frequencies
data.loc[0:9,'class'].value_counts() -> returns the variables and their frequencies for row 0 to 9
Uses K-Nearest Neighbors classification from the sklearn library. The class of a data point is assigned based on the class of its 3 nearest neighbors. For training, the data is split into x (all columns except the "class" column) and y (the "class" column). The classifier is trained using knn.fit(x, y), and predictions are made on the same data.
sklearn
x
y
knn.fit(x, y)
* train: use train dataset for fitting
* test: make prediction on test dataset
* With train and test datasets, fitted data and tested data are completely different
* train_test_split(x,y,test_size = 0.3,random_state = 1)
* x: features
* y: target variables (normal, abnormal)
* test_size: percentage of test size, e.g. test_size = 0.3, meaning test size = 30% and train size = 70%
* random_state: sets a seed; if this seed is the same number, train_test_split() produce the exact same split at each run
* fit(x_train,y_train): fit on train datasets
* score(x_test,y_test): predict and give accuracy on test datasets
Shows the confusion_matrix between the predicted outcome and the real outcome. Shows true positive and true negatives as well as false negatives and false positives giving more information than just the accuracy alone.
The classification report gives further information about:
precision, recall, f1-score and number of total trues and total false as well as accuracy
This code performs linear regression and visualizes the results:
Imports LinearRegression from sklearn.linear_model.
LinearRegression
sklearn.linear_model
Initializes the linear regression model.
Extracts the pelvic_incidence column as x_linear and sacral_slope column as y_linear from data1, reshaping them to 2D arrays.
pelvic_incidence
x_linear
sacral_slope
y_linear
data1
Creates a prediction space using np.linspace based on the range of x.
np.linspace
Fits the linear regression model to x_linear and y_linear.
Predicts values using the fitted model over the prediction space.
Prints the R^2 score of the model.
Plots the regression line and scatter plot of the data.
Saves the plot as 'Scatter_matrix_reg_lin.png' and displays it.
The code implements Ridge regression, which includes L2 regularization to prevent overfitting by penalizing large coefficients. This helps in creating a more generalized model that performs better on unseen data.
alpha is a hyperparameter for adjusting the “strength” of the penalty. High values will lead to undefitting low values will lead to overfitting.
The pipline is used to apply transformations and final estimator.
The StandardScaler is used to perform standardisation to the features. (Standardizing the features ensures that each feature contributes equally to the model)
The R^2 score indicates how well the model's predictions match the actual data.
The code implements Lasso regression, which includes L1 regularization to prevent overfitting by penalizing the absolute values of the coefficients.
Binary Class Conversion:
Creates a new column class_binary in the data DataFrame where 'Abnormal' is converted to 1 and any other value is converted to 0.
class_binary
Feature and Target Separation:
x: All columns except 'class' and 'class_binary'.
y: The 'class_binary' column.
Data Splitting:
Splits the data into training and testing sets with 30% of the data reserved for testing and a random state of 42 for reproducibility.
Logistic Regression Initialization:
Initializes a logistic regression model.
Model Fitting:
Fits the logistic regression model to the training data (x_train and y_train).
x_train
y_train
Prediction of Probabilities:
predict_proba method is used to predict the probabilities of the test data (x_test). This method returns an array with two columns: the first column contains the probabilities of the class being 0, and the second column contains the probabilities of the class being 1.
predict_proba
x_test
[:,1] is used to select the probabilities of the class being 1 (i.e., 'Abnormal').
[:,1]
roc_curve is used to compute the false positive rate (FPR), true positive rate (TPR), and thresholds for different classification thresholds based on y_test and y_pred_prob.
roc_curve
y_test
y_pred_prob
Plots a diagonal line representing a random classifier (plt.plot([0, 1], [0, 1], 'k--')).
plt.plot([0, 1], [0, 1], 'k--')
Uses cross_val_score method to compute the corss validation of K folds, together with LinearRegression algorithm to predict a continuous numerical value based on to independent variables -> “pelvicincidence” and “sacralslope. K = 5 means that the training data is split into 5 equaly big data pakages. K-1 folds are used for traning and 1 fold is used for validation. The process of learning is repeated 5 times changing the validation fold each time. Returns the R^2 score for each iteration and the average R^2 score of all iterations.
Uses K-Nearest-Neighbor classification together with GridSearch for Hyperparamter tuning. Trying different values for K “n_neighbors” to see which value gives the best performance. Further a corss validation is used spliting the training data into 3 equally sized folds. 2 of wich are used for training the other used for validation -> changing the validation fold 3 times.
Printing the bets hyperparameters together with the R^2 score.
Last changed7 hours ago