Data Transmission
Definition: transfer of data between devices via streams/channels.
Increasing importance due to growing data volumes.
Reliability & high-speed transfer critical for businesses.
Example (IoT in logistics): GPS/sensors transmit location, temperature, humidity → enable real-time monitoring & fast responses to issues.
Basics of Information Theory (Shannon, 1948)
Mathematical study of data transmission & processing limits.
Any information can be represented in binary (0/1).
Shannon’s contributions:
Defined channel capacity considering noise.
Maximized channel efficiency with encoding schemes.
Developed communication model with 6 elements:
Source (origin of message).
Encoder (message → signals, adds disguise/adaptation).
Channel (medium: wires, air, fiber).
Noise (interference: sounds, weather).
Decoder (signal → understandable format).
Receiver (final entity).
entropy
minimum compressed data volume, measure of information content.
Shannon’s Source coding theorem
defines limits of data compression.
noisy-channel coding theorem
error detection in noisy transmissions.
OSI Stack Model (ISO Standard)
Framework describing interaction of hardware & software for network communication.
7 layers of OSI Stack Model
Physical layer → physical resources (cables, hubs, modems), transmission of signals.
Data link layer → LLC (protocols, flow/error control) + MAC (linking physical & LLC). Detects/corrects errors.
Network layer → IP addressing, routing, forwarding packets.
Transport layer → packet order, protocols: TCP (reliable) & UDP (fast, no guarantee).
Session layer → maintains device communication (authentication, reconnections).
Presentation layer → “translator” (formatting, syntax, encryption/decryption).
Application layer → user interface for network resources (browsers, email, FTP).
Data Quality & Cleansing
Data from multiple sources often suffers from issues: noisy, inaccurate, incomplete, inconsistent, missing, duplicate, or outlying values.
Key quality factors: accuracy, interpretability, consistency, completeness.
Poor quality affects ML models → inaccurate predictions & unreliable results.
preparation of data steps
Data cleaning (fix missing values, outliers, duplicates).
Data transformation.
Dimensionality reduction.
Missing Values & Outliers
Missing values: unobserved or incorrectly recorded data.
Outliers:
“True” outliers → real but unusual events.
“Fake” outliers → data errors.
Methods to handle missing/outlier values
Removal of records (valid for large datasets; risky if important info is lost).
Linear interpolation (estimate missing values from neighboring points).
Mean imputation (replace with average of variable).
Mode imputation (replace with most frequent value).
Tracking changes
add indicator variable (0 = original, 1 = modified).
Duplicate Records
Common in datasets from multiple sources (e.g., separate regional customer databases).
Duplicates can distort analysis → e.g., splitting customer history across records.
Leads to inefficiency, wrong insights, longer computing time.
Good practice: detect & remove duplicates to improve accuracy & efficiency.
data normalization
consider unit of measurement
during preprocessing stage
transforming the data in such way that the new values fall within a smaller or common rage, e. g. between 0 and 1
gives all the attributes an equal weight
prevents one attribute from dominating the results due to its larger magnitude
min-max normalization
linear transformation of the original data
scales the data into a fixed range, typically between the values 0 (min) and 1 (max)
min-max normalization formula
z-score normalization
transforms numerical data so that it has a mean of 0 and a standard deviation of 1
how many standard deviations a particular data point is away from the mean
z-score normalization formula
decimal scaling
shift the decimal point of the original data value by using a factor of 10
logarithm transformation
Definition: X= log(x)
Purpose: Applied in linear regression when linearity between input variables is missing.
Example: Population data → log transformation makes population vs. area relationship more linear.
Effect:
Compresses large values.
Stretches small values.
Produces a more balanced range of values.
Data Discretization
Definition: Converts continuous values (e.g., age) into intervals (0–18, 19–30) or conceptual labels (child, adult).
Purpose:
Handles non-linear continuous variables.
Required when algorithms need categorical inputs.
Types:
Supervised: Uses class information.
Unsupervised: No class information used.
Data Discretization Main Techniques
Binning (unsupervised)
Histogram Analysis (unsupervised)
Binning
Groups continuous values into intervals (“bins”).
Values replaced by bin mean or median (smoothing).
Example: Prices [1,1,3,4,5,6,7,7,9,9] → bins.
Equal width: same interval size (1–3, 4–6, 7–9).
Equal frequency: bins contain same number of values.
Histogram Analysis
Graphical representation of value distribution.
Divides values into disjoint buckets.
Buckets with one value are “singleton buckets.”
Data Dimensionality Reduction
Goal: Reduce number of variables without losing main dataset properties.
Benefits:
Less execution time & storage.
Removes irrelevant features → higher model accuracy.
Easier interpretation & visualization.
Techniques of Dimensionality Reduction
Feature Selection
Correlation Analysis
Feature Extraction
Selects a subset of relevant features.
Removes noisy or biased data.
Lowers model complexity → more efficient & accurate.
Measures relationships between feature pairs.
Correlation coefficient (ρ):
Range: -1 to 1.
ρ = 1 → strong positive correlation.
ρ = 0 → no correlation.
ρ = -1 → strong negative correlation.
Highly correlated features → one can be removed.
Methods: Pearson, Kendall, Spearman, Point-Biserial, Chi-squared test.
correlation coefficient formula
covariance formula
standard deviation formula
Transforms original features into new, lower-dimensional ones.
Goal: Keep relevant info, reduce overfitting & complexity.
Examples:
PCA (Principal Component Analysis): Creates new “principal components” capturing maximum variance.
LDA (Linear Discriminant Analysis).
KPCA (Kernel PCA).
Applications: healthcare data, image resizing, stock data analysis.
Zuletzt geändertvor 10 Tagen