Data pre-processing

by Lukas T.

How does a basic machine lerning learning algorithmn look like?

Weights are assigned to features in the input data.
Output is predicted based on these weights.
Prediction is compared to the real output.
The algorithm adjusts the weights if the prediction differs from the real output.
This process is repeated until the weights are optimized.

This is the general idea behind many machine learning algorithms, especially those involving supervised learning. The iterative process of adjusting weights based on the error between the predicted and actual outputs is a key component of algorithms like linear regression and neural networks.

How does deep learning algorithmn look like?

Yes, you've got it! Each neuron in a neural network has connections to all the features of the input data, but the weights of these connections will be different for each neuron.

Here's how it works:

Connections to Features: Each neuron in a layer is connected to every feature in the input data (or previous layer).
Different Weights: The weights of these connections are different for each neuron. These weights determine how strongly each feature influences the neuron's output.

For example:

Edge Detection Neuron: This neuron might have higher weights for features that indicate edges (sharp changes in pixel values).
Color Detection Neuron: This neuron might have higher weights for features that indicate specific colors (specific ranges of pixel values).

So, even though both neurons are connected to the same set of features, they focus on different aspects of the input data because their weights are different. This allows the network to learn and identify various patterns in the data, such as edges and colors in an image.

Neural networks can learn to represent complex relationships by adjusting these weights during training, enabling them to perform tasks like image recognition, language translation, and much more.

What makes Machine learning testing different from raditional software testing?

Deterministic vs. Probabilistic Outputs: ML systems produce probabilistic outputs, making specific test cases challenging to write -> the logic behinde the output is not known.
Data Dependency: High-quality and diverse data are crucial for ML models, affecting their performance and reliability.
Output Variability: ML systems can produce different outputs for the same input, necessitating statistical validation methods.
Continuous Monitoring: ML systems need ongoing monitoring and retraining to adapt to new data and maintain performance.

Understanding these key differences helps in designing appropriate testing strategies for ML systems, such as:

Using statistical methods for validation.
Emphasizing data quality and diversity.
Implementing continuous monitoring and retraining processes.

What are the focus areas of Machine learning testing?

Why is data preprocessing necessary

The input data havily impacts the performnce of the machine lerning model. Since the goal is to produce outputs with a high accurancy and precision it is necessary that the input data can be correctly interpreted. Input data that is missing data or has outliers might lead to a false assumption (false weighting in the model). Therfore it is adivised that especially the data used for training is equaly distributed covering a variaty of scenarious.

What are some possibilities of Data pre-processing?

Data Cleaning: Improves data quality by handling missing values, correcting errors, and removing duplicates.
1. e.g Use mean value of other values of the same feature
  Example: Age Data of some participants in a test is missing -> just take the mean value of the participants age that we know
Data Transformation: Converts data into a suitable format through scaling, encoding, and feature engineering.
1. e.g. changing pixel-size of input image so that all images have the same size or use one.hot encoding for zip codes (categorical data)
Data Integration: Combines data from different sources to provide a unified view.
1. merging different tables based on common feature
Data Standardization/Normalization: Ensures consistent scale and format, improving model performance.
1. Normalization -> To scale the features of the dataset to a range, typically [0, 1] or [-1, 1].
2. Standardisation -> to scale the features of the dataset to have a mean of 0 and a standard deviation of 1.

Standardisation is used for algorithmns that require a standard normal distribution of the data such as linear regression, logistic regression. While Normalisazion is used for algorithmns that do not require standard normal distribution of the data such as k-nearest neighbors and neural networks.

data.info()

General Information:

data type
number of sample or row
number of feature or column
feature types and memory usage

print(data['Type 1'].value_counts(dropna =False))

displays the different categories that are within the column vector ‘Type 1’ and the corresponding frequencies

data.describe()

Used to display basic statistics of the data such as:

mean
std
min
max
1st. quatile, median, 3rd. quatile

data1 = data.head()

data2= data.tail()

conc_data_row = pd.concat([data1,data2],axis =0,ignore_index =True)

conc_data_row

Concatenates the first 5 rows of the data with the last 5 rows of the data.

axis = 0 -> means concatenate rows (1 -> columns)
ignore_index = true -> means no rearagement of the indexes

data['Type 2'].isna().sum()

data.isna().sum()

data1 = data.dropna(axis='columns')

data1 = data.dropna()

Check the amount of NaN values in the Type 2 column of the data
checks the amount of NaN values of each column in the data seperatly
1. data.isna().sum().sum() will summerize the found NaN values of each column (feature) and sum them up
drops all columns that have at least on NaN value in the column
4. drops all rows wich have at least one NaN value

values = {"Name": 'Unknown', "Type 2": 0}

data2 = data.fillna(value=values)

Replace all NaN values of the column “Name” with ‘Unknown’ and all NaN value of Type 2 with 0

encoder = OneHotEncoder(handle_unknown='ignore')

encoder_df = pd.DataFrame(encoder.fit_transform(data[['Type 1']]).toarray())

final_df = data.join(encoder_df)

Takes the cathegorical strings of Type 1 column and converts them into a one-hot encoded representation which is added to the data. Machine learning algorithm can learn better with one-hot-encoded values than with strings

data["HP"][1]

data.HP[1]

data.loc[1,["HP"]]

All code snippets extract the 2nd element of the column “HP” (indexing starts at 0)

boolean = data.HP > 200 # filtering data vectors where HP > 200

print(boolean)

data[boolean]