Data Processing – Definition & Purpose
Extracts useful info from raw data.
Similar to industrial process: input (raw data) → process → output (insights).
Applications: office automation, ticketing, time tracking, resource planning, forecasting, optimization.
Benefits of Systematic Data Processing
Better data analysis & presentation.
Reduces data to meaningful info.
Easier storage & distribution.
Simplified report creation.
Improved productivity & profits.
More accurate decision-making.
Data Processing Pipeline
Data collection
Data preprocessing
Data analysis & model building
Insight implementation
Parallel: Data storage
Types of Data Processing
Batch Processing
Online Processing
Real-time Processing
Distributed Processing
Time-sharing Processing
Processes stored data in groups (batch jobs).
Efficient for repetitive tasks.
Examples: monthly billing, reports, inventory.
Interactive, system processes data as it becomes available.
Uses internet, data centers, cloud computing.
real-time processing
Immediate response to inputs/events.
Provides instant outputs.
Higher costs but fast results.
Example: banking transactions.
distributed processing
Operations divided across multiple servers.
High fault tolerance (workload reallocated if node fails).
Example: Hadoop Distributed File System (HDFS).
time-sharing processing
Multiple users share one CPU in time slots.
Appears as exclusive access.
Typical for mainframe/supercomputers.
main data base systems
OLTP
OLAP
OLTP Systems
Online Transaction Processing
Handle read/write of individual records at high rate.
Transactional databases → many users simultaneously.
Used for day-to-day operations.
Example: bank account balance updates.
Not suitable for large data scans.
OLAP Systems
Online Analytical Processing
Designed for large analytical queries.
Not efficient for single record searches.
Focus: business insights, data summarization, decision-making.
Used by knowledge workers & decision-makers.
Good for interactive analytics.
ETL Layer
Extract, Transform, Load
Integration layer between multiple structured/unstructured data sources & data warehouse.
Steps:
Extract data from sources.
Transform (clean, standardize, format).
Load into warehouse.
Purpose: serve specific business goals (e.g., CRM, web analytics, partner data).
ETL tools automate data flow & enable regular updates.
Data Preparation After ETL
Step before data analysis.
Tasks:
Handle missing values.
Remove redundant, incomplete, duplicate, incorrect records.
Goal: create final clean dataset for analysis.
Data Analysis & Model Building
Time depends on:
Processing device power.
Complexity & amount of input data.
Storage location (e.g., data lake, database).
Desired final output.
Machine Learning Basics
Algorithms extract knowledge, uncover properties, predict outcomes.
Applicable to many model types, simple or complex.
Supervised learning → future predictions (labeled data).
Unsupervised learning → explore data, hidden patterns (unlabeled data).
No universal best method → experimentation needed.
Key factors: data size/type, desired insights, use of results.
Problem Types
Classification (supervised)
Regression (supervised)
Clustering (unsupervised)
Classification
Uses labeled data.
Goal: assign data points to classes.
Examples: spam detection, image classification, fraud detection.
Regression
Predicts continuous numerical values.
Fits model to data to find dependencies.
Examples: house prices, energy forecasting, customer lifetime value.
Clustering
Groups unlabeled data based on similarity.
Finds hidden patterns/structures.
Examples: customer segmentation, document clustering, market segmentation.
Insight Implementation
Translate machine learning outcomes into actions → guides company decisions.
Must be presented in a user-friendly form (tables, audio, video, images).
Actions (automatic or human-based) are more valuable than just insights.
Model performance judged via:
Evaluation metrics
Key Performance Indicators (KPIs).
Data Visualization
Enhances understanding via graphical representation.
Key skill for data scientists.
Purposes: exploration, error detection, communication of results.
Visualization prevents misleading conclusions from raw numbers.
Common visualization types: tables, histograms, scatter plots, geomaps, charts, heat maps.
Tables
Useful for many variables or mix of numerical & categorical attributes.
Tips:
Sort/group rows by important column.
Sort columns by importance & relation.
Highlight important values (colors, fonts).
Histogram
Shows frequency distribution with bars.
X-axis = bins (value ranges), Y-axis = frequency.
Best for continuous numeric variables.
Key aspects:
Equal bin sizes recommended.
Absolute vs. relative frequency.
Experiment with bin numbers for best clarity.
Scatter Plots
Plot two variables → show correlation (linear/nonlinear, pos/neg).
Common in regression analysis.
Characteristics:
Adjust dot size to dataset.
Limited to 2D → use PCA or multiple plots for multivariate data.
Maps
Show data using spatial arrangements & color scales.
Geomap → geographic regions with color scales.
Cartogram → distorts regions to reflect a variable.
Periodic table = special type of conceptual map.
Line & Area Charts
Show changes over time.
Line = trends, Area = filled version for quantitative comparison.
Can compare multiple variables (e.g., sales vs. expenses).
Bar Charts
Use bar lengths for categorical data.
Easy comparison of values, trends, minima & maxima.
Pie Charts
Show proportions (sum = 100%).
Each slice = category (with color & legend).
Often annotated with percentages.
Combo Charts
Combine bar & line graphs.
Useful when variables differ significantly.
Highlights comparisons across datasets.
Bubble Charts
Extend scatter plot → add color & size as variables (2–4D).
Used for analyzing patterns & correlations.
Risk: hard to read with too much data.
Can include interactivity (hover/click for details).
Heat Maps
Represent values via colors.
Uses:
Geographic (activity zones, e.g., football field).
Web analytics (user clicks & engagement).
Light = low correlation/activity, Dark = high.
Useful in feature selection (drop less relevant variables).
Visualization Tools
Exploratory → Excel, Jupyter, R, Mathematica.
Presentation → Python, R (libraries for tables & charts).
Interactive → Dashboards (Python, Tableau).
Output Formats – General
Processed data must be saved in understandable formats → enables communication, sharing & decision-making.
Data format = structure & organization of data (rules, syntax, conventions).
Ensures compatibility across systems, software & platforms.
Importance of Choosing the Right Format
Should be machine-readable (type, size, structure info).
Should be human-readable for analysis & visualization.
Should be popular/standardized → interoperability.
Clear & simple formats help identify redundant or correlated data → more accurate processing.
Common Data Formats
XLS
CSV
XML
JSON
Protobuf
Apache Parquet
XLS (Excel Spreadsheet)
Microsoft Excel format.
Stores data in tables (rows = records, columns = variables).
Cells can contain text, numbers, formulas.
Supports summarization & visualization.
Longstanding, widely used in organizations.
CSV (Comma-Separated Values)
Plain text format, tabular structure.
Each line = data record, values separated by commas.
First row = optional header.
Simple, easy for both humans & machines.
XML (Extensible Markup Language)
Markup language, flexible & standard for communication.
Human & machine-readable.
Uses tags (not predefined, user-defined).
Must follow rules (opening/closing tags, schema/DTD for validation).
<student>
<name>Charlie</name>
<lastname>Wood</lastname>
<age>20</age>
</student>
JSON (JavaScript Object Notation)
Lightweight, text-based, easy for humans & machines.
Widely used in web apps & APIs.
Written as key-value pairs inside curly braces { }.
{ }
Arrays use square brackets [ ].
[ ]
Protobuf (Protocol Buffers)
Google’s system for serializing structured data.
Language & platform neutral.
Similar to XML but smaller & faster.
Structure defined first, then source code auto-generated for read/write.
Used in Java, Python, C++.
Column-oriented format in Hadoop ecosystem.
Efficient for big data & datasets with many columns.
Reads only required columns → faster & reduces I/O.
Provides better compression for same-type data.
Can store nested structures & access fields individually.
Best for large-scale data processing.
Data Storage General
Defines how information is stored for quick access & retrieval.
Increasingly complex due to large volumes & access patterns.
Shift from single server → distributed storage:
Uses multiple servers.
Provides redundancy & scalability.
Requires coordination (store, retrieve, process).
Storage Options
Data Warehouse
Data Lake
Data Mart
Stores relational data from business apps/systems.
Provides consolidated historical data for analysis.
Kept separate from operational systems.
Well-defined structure (schema on write).
Based on OLAP architecture.
Used for:
Regular reporting (daily/weekly).
Visualizations.
Business Intelligence (BI).
Stores relational + non-relational data (e.g., IoT, social media).
Schema on read → flexible, all kinds of data.
Supports:
SQL queries.
Real-time analytics.
Machine learning & predictive analytics.
Challenges: cataloging & securing unstructured data.
Best deployed in cloud → security, elasticity, scalability.
Subset of a data warehouse.
Stores information specific to a group of users/department.
Solves specific organizational problems.
Zuletzt geändertvor 10 Tagen