Data Processing – Definition & Purpose
Extracts useful info from raw data.
Similar to industrial process: input (raw data) → process → output (insights).
Applications: office automation, ticketing, time tracking, resource planning, forecasting, optimization.
Benefits of Systematic Data Processing
Better data analysis & presentation.
Reduces data to meaningful info.
Easier storage & distribution.
Simplified report creation.
Improved productivity & profits.
More accurate decision-making.
Types of Data Processing
Batch Processing
Online Processing
Real-time Processing
Distributed Processing
Time-sharing Processing
Processes stored data in groups (batch jobs).
Efficient for repetitive tasks.
Examples: monthly billing, reports, inventory.
Interactive, system processes data as it becomes available.
Uses internet, data centers, cloud computing.
real-time processing
Immediate response to inputs/events.
Provides instant outputs.
Higher costs but fast results.
Example: banking transactions.
distributed processing
Operations divided across multiple servers.
High fault tolerance (workload reallocated if node fails).
Example: Hadoop Distributed File System (HDFS).
time-sharing processing
Multiple users share one CPU in time slots.
Appears as exclusive access.
Typical for mainframe/supercomputers.
main data base systems
OLTP
OLAP
OLTP Systems
Online Transaction Processing
Handle read/write of individual records at high rate.
Transactional databases → many users simultaneously.
Used for day-to-day operations.
Example: bank account balance updates.
Not suitable for large data scans.
OLAP Systems
Online Analytical Processing
Designed for large analytical queries.
Not efficient for single record searches.
Focus: business insights, data summarization, decision-making.
Used by knowledge workers & decision-makers.
Good for interactive analytics.
ETL Layer
Extract, Transform, Load
Integration layer between multiple structured/unstructured data sources & data warehouse.
Steps:
Extract data from sources.
Transform (clean, standardize, format).
Load into warehouse.
Purpose: serve specific business goals (e.g., CRM, web analytics, partner data).
ETL tools automate data flow & enable regular updates.
Data Preparation After ETL
Step before data analysis.
Tasks:
Handle missing values.
Remove redundant, incomplete, duplicate, incorrect records.
Goal: create final clean dataset for analysis.
Data Analysis & Model Building time
Time depends on:
Processing device power.
Complexity & amount of input data.
Storage location (e.g., data lake, database).
Desired final output.
Machine Learning Basics
Algorithms extract knowledge, uncover properties, predict outcomes.
Supervised learning → future predictions (labeled data).
Unsupervised learning → explore data, hidden patterns (unlabeled data).
Key factors: data size/type, desired insights, use of results.
Problem Types
Classification (supervised)
Regression (supervised)
Clustering (unsupervised)
Classification
Uses labeled data.
Goal: assign data points to classes.
Examples: spam detection, image classification, fraud detection.
Regression
Predicts continuous numerical values.
Fits model to data to find dependencies.
Examples: house prices, energy forecasting, customer lifetime value.
Clustering
Groups unlabeled data based on similarity.
Finds hidden patterns/structures.
Examples: customer segmentation, document clustering, market segmentation.
Insight Implementation
Translate machine learning outcomes into actions → guides company decisions.
Must be presented in a user-friendly form (tables, audio, video, images).
Actions (automatic or human-based) are more valuable than just insights.
Model performance judged via:
Evaluation metrics
Key Performance Indicators (KPIs).
Data Visualization
Enhances understanding via graphical representation.
Key skill for data scientists.
Purposes: exploration, error detection, communication of results.
Visualization prevents misleading conclusions from raw numbers.
Common visualization types: tables, histograms, scatter plots, geomaps, charts, heat maps.
Tables
Useful for many variables or mix of numerical & categorical attributes.
Tips:
Sort/group rows by important column.
Sort columns by importance & relation.
Highlight important values (colors, fonts).
Histogram
Shows frequency distribution with bars.
X-axis = bins (value ranges), Y-axis = frequency.
Best for continuous numeric variables.
Key aspects:
Equal bin sizes recommended.
Absolute vs. relative frequency.
Experiment with bin numbers for best clarity.
Scatter Plots
Plot two variables → show correlation (linear/nonlinear, pos/neg).
Common in regression analysis.
Characteristics:
Adjust dot size to dataset.
Limited to 2D → use PCA or multiple plots for multivariate data.
Maps
Show data using spatial arrangements & color scales.
Geomap → geographic regions with color scales.
Cartogram → distorts regions to reflect a variable.
Periodic table = special type of conceptual map.
Line & Area Charts
Show changes over time.
Line = trends, Area = filled version for quantitative comparison.
Can compare multiple variables (e.g., sales vs. expenses).
Bar Charts
Use bar lengths for categorical data.
Easy comparison of values, trends, minima & maxima.
Pie Charts
Show proportions (sum = 100%).
Each slice = category (with color & legend).
Often annotated with percentages.
Combo Charts
Combine bar & line graphs.
Useful when variables differ significantly.
Highlights comparisons across datasets.
Bubble Charts
Extend scatter plot → add color & size as variables (2–4D).
Used for analyzing patterns & correlations.
Risk: hard to read with too much data.
Can include interactivity (hover/click for details).
Heat Maps
Represent values via colors.
Uses:
Geographic (activity zones, e.g., football field).
Web analytics (user clicks & engagement).
Light = low correlation/activity, Dark = high.
Useful in feature selection (drop less relevant variables).
Visualization Tools
Exploratory → Excel, Jupyter, R, Mathematica.
Presentation → Python, R (libraries for tables & charts).
Interactive → Dashboards (Python, Tableau).
Output Formats – General
Processed data must be saved in understandable formats → enables communication, sharing & decision-making.
Data format = structure & organization of data (rules, syntax, conventions).
Ensures compatibility across systems, software & platforms.
Importance of Choosing the Right Format
Should be machine-readable (type, size, structure info).
Should be human-readable for analysis & visualization.
Should be popular/standardized → interoperability.
Clear & simple formats help identify redundant or correlated data → more accurate processing.
Common Data Formats
XLS
CSV
XML
JSON
Protobuf
Apache Parquet
XLS (Excel Spreadsheet)
Microsoft Excel format.
Stores data in tables (rows = records, columns = variables).
Cells can contain text, numbers, formulas.
Supports summarization & visualization.
Longstanding, widely used in organizations.
CSV (Comma-Separated Values)
Plain text format, tabular structure.
Each line = data record, values separated by commas.
First row = optional header.
Simple, easy for both humans & machines.
XML (Extensible Markup Language)
Markup language, flexible & standard for communication.
Human & machine-readable.
Uses tags (not predefined, user-defined).
Must follow rules (opening/closing tags, schema/DTD for validation).
<student>
<name>Charlie</name>
<lastname>Wood</lastname>
<age>20</age>
</student>
JSON (JavaScript Object Notation)
Lightweight, text-based, easy for humans & machines.
Widely used in web apps & APIs.
Written as key-value pairs inside curly braces { }.
{ }
Arrays use square brackets [ ].
[ ]
Protobuf (Protocol Buffers)
Google’s system for serializing structured data.
Language & platform neutral.
Similar to XML but smaller & faster.
Structure defined first, then source code auto-generated for read/write.
Used in Java, Python, C++.
Column-oriented format in Hadoop ecosystem.
Efficient for big data & datasets with many columns.
Reads only required columns → faster & reduces I/O.
Provides better compression for same-type data.
Can store nested structures & access fields individually.
Best for large-scale data processing.
Data Storage General
Defines how information is stored for quick access & retrieval.
Increasingly complex due to large volumes & access patterns.
Shift from single server → distributed storage:
Uses multiple servers.
Provides redundancy & scalability.
Requires coordination (store, retrieve, process).
Storage Options
Data Warehouse
Data Lake
Data Mart
Stores relational data from business apps/systems.
Provides consolidated historical data for analysis.
Kept separate from operational systems.
Well-defined structure (schema on write).
Based on OLAP architecture.
Used for:
Regular reporting (daily/weekly).
Visualizations.
Business Intelligence (BI).
Stores relational + non-relational data (e.g., IoT, social media).
Schema on read → flexible, all kinds of data.
Supports:
SQL queries.
Real-time analytics.
Machine learning & predictive analytics.
Challenges: cataloging & securing unstructured data.
Best deployed in cloud → security, elasticity, scalability.
Subset of a data warehouse.
Stores information specific to a group of users/department.
Solves specific organizational problems.
Zuletzt geändertvor 19 Tagen