Raw Data
Unprocessed, unorganized data collected from sources; needs processing to become useful
Key Concepts
Data Mining: Discovering patterns in large datasets.
Business Intelligence (BI): Focuses on historical and real-time data for descriptive analytics; uses tools like Excel, SQL.
Data Science vs. BI: Data science focuses on predictive/prescriptive analytics, algorithms, programming, and machine learning.
Data → Information → Knowledge:
Data: Raw facts without context.
Information: Organized, contextualized data.
Knowledge: Insights and understanding from information.
Artificial Intelligence (AI)
Goal: Create systems performing tasks needing human-like intelligence.
Types:
ANI: Narrow AI (e.g., NLP, facial recognition).
AGI: General AI (not yet achieved; handles any task).
ASI: Superintelligence (future potential; surpasses human intelligence
Machine Learning (ML)
Definition: Systems learn from data to make predictions.
Supervised (classification, regression).
Unsupervised (clustering, pattern detection).
Reinforcement learning (goal-oriented learning via rewards/penalties).
Data Splits: Training set (model building) vs. Testing set (performance testing)
Model Evaluation
Classification Metrics:
Type I Error: False positives.
Type II Error: False negatives.
Sensitivity: True positive rate.
Specificity: True negative rate.
Regression Metrics: Absolute error, mean square error, relative error.
Applications of Data Science
Industry: Process automation & optimization (e.g., Shell.ai).
Business: Customer insights, sales predictions, HR analytics (e.g., Google People Analytics).
Text Data: Sentiment analysis, chatbots (e.g., Airbnb AI assistant).
Image Data: Object detection, scene classification (e.g., smart city transport planning).
Medical Data: Disease detection, drug development, remote monitoring (e.g., Pfizer predictive analytics).
Data science activities: Data Flow
Goal: Manage movement of data from source to access.
Data Collection:
Identify sources, relevant attributes, and involved people.
Methods: observation, interviews, surveys, existing records.
Ensure accuracy & reliability.
Data Storage:
Requirements: accessibility, transparency, security.
Consider type, form, frequency of use, and cost.
Options: distributed storage, cloud (watch for security risks).
Data Access:
Structured data → queries, XML.
Unstructured data → NoSQL technologies.
Data science activities: Data Curation
Goal: Refine and prepare data for use.
Data Preservation:
Remove noise (errors), fix missing values.
Depends on domain knowledge.
Storage lifespan example: flash drives (80–90 yrs), HDDs (3–6 yrs).
Data Description:
Use schemes & metadata for meaning.
Enables understanding and better use.
Data Publication:
Make data available for others.
Clean, format, describe for usability & value.
Data Security:
Identify threats: physical (infrastructure damage), human error, malicious activity.
Measures: monitoring, firewalls, intrusion detection, encryption, physical safety.
Data science activities: Data Analytics
Goal: Extract insights & predict future events for decision-making.
Statistical Analysis:
Methods: matrix-valued analysis, prediction models, PCA, clustering, sampling.
Modeling & Simulation:
Test real-world scenarios with large datasets.
Not 100% accurate, but useful.
Examples: Monte Carlo, Markov Chain Monte Carlo.
Visual Techniques:
Present results for better understanding.
Examples: box plots, histograms, word clouds, charts.
Common Data Sources
Trustworthiness is key for robust, high-quality data.
Organizational / Trademarked Data:
Data from company activities: transactions, customers, employees, products.
Examples: Google, Facebook collect massive user data.
Reluctance to share due to competition risk & customer privacy concerns.
Government Data:
Many governments provide open data (e.g., demographics, economics).
Example: USA’s Data.gov → 300,000+ datasets, privacy protected.
Academic Data:
Large datasets from research in medicine, economics, history.
Often publicly available via journals & databases (e.g., Google Scholar).
Social Media Data:
Text, videos showing user interactions.
APIs give structured access (e.g., Twitter API for tweet attributes).
Useful for sentiment analysis & pattern detection.
Data Types
Quantitative Data (measurable)
Discrete: Fixed numbers (e.g., number of students, age).
Continuous: Range values
Interval → can be negative (e.g., °C/°F temperature).
Ratio → >0 only (e.g., height, weight).
Qualitative Data (descriptive)
Nominal: No order (e.g., eye color).
Ordinal: Ordered categories (e.g., salary grade).
Binary: Two categories (e.g., on/off).
Data Shapes
Structured: Organized, tabular (rows & columns).
Unstructured: Unknown form (text, images, audio, video).
Semi-Structured: Partial organization (e.g., metadata, tags, e-mails: to and cc are structured, Body is Not)
Streaming Data: Continuous, real-time flow from multiple sources (e.g., social media feeds, sensors).
The 5 Vs of Data
Volume: Large scale (terabytes → zettabytes). Example: plane sensors generate 10 GB/sec.
Variety: Multiple types (structured, unstructured).
Velocity: Speed of creation & processing (e.g., 500 hrs video uploaded to YouTube/min).
Veracity / Validity: Trustworthiness & relevance (remove noise, avoid outdated data).
Value: Business benefits (new services, improved customer experience & operations).
Data Processing Definition and Goal
Definition: Extract useful information from raw data (often unstructured, incomplete, inconsistent).
Goal: Identify patterns, trends, correlations → enable analysis, decision-making, actionable insights.
Data Processing Pipeline
Data collection
Data preprocessing
Data analysis / model building
Insight implementation
Data storage (parallel to all stages).
Descriptive statistics
statistical method that enables us to summarize and describe the properties and characteristics of a given dataset or sample
provide end users with an understandable description
Central tendency measures
describe the center of a dataset’s distribution
mean: arithmetic average of the variable’s values
mode: most repeated element in a dataset
median: value located exactly at the middle point of a variable’s sorted values
Variation measures
range: difference between highest and lowest value, heavily influenced by outliers
standard deviation: square difference between a unique element and the mean
variance: square of the standard deviation measure
mutually exclusive events
a particular event (N) cannot occur at the same time as another event (M)
mutually independent events
events can occur simultaneously
conditional probability
events can be correlated with each other —> dependent on each other
conditional probability: probability of event A, given that event B has already occured
categories of random variables
discrete random variables: can only take distinct values, e. g. students in a class, number of family members
continuous random variables: can take all values in an intervall, e. g. height of students
probability density function
probability of a continuous random variable to take a given value
not possible to calculate the probability for each value —> calculate the probability of a value within an interval by estimating the area under the curve within that interval
probability mass function
likelihood of a discrete random variable to take a given value
possible to find a porbability for each of the variables
normal distribution
bell-curve shape
mean: center of the bell
standard deviation: wideness of the bell shape
binomial distribution
whether or not an event occurs —> two possible outcomes
probability of getting a given outcome a precise number of times
formula to calculate binomial distribution
Poisson distribution
frequency with which an independent event occurs within a specific interval
not the probability of an event, but how often it occurs in a specific period of time
formula Poisson distribution
Bayesian Statistics
the knowledge available about the given parameters is updated with the new information gathered from the data observed
we want to know how event A is conditioned by event B but we only know how event B is conditioned by event A
formula Bayes Statistics
Data science Pipeline
Collection → Preprocessing → Analysis/Modeling → Insights Implementation → Storage.
Goal of data science
Extract knowledge from data to make predictions and guide decision-making
Main Drivers data science
Better computing power,
more available data,
improved storage capabilities
Last changed9 days ago