1. Introduction

Buffl

Data Science

by Marie R.

Raw Data

Unprocessed, unorganized data collected from sources; needs processing to become useful

Key Concepts

Data Mining: Discovering patterns in large datasets.
Business Intelligence (BI): Focuses on historical and real-time data for descriptive analytics; uses tools like Excel, SQL.
Data Science vs. BI: Data science focuses on predictive/prescriptive analytics, algorithms, programming, and machine learning.
Data → Information → Knowledge:
- Data: Raw facts without context.
- Information: Organized, contextualized data.
- Knowledge: Insights and understanding from information.

Artificial Intelligence (AI)

Goal: Create systems performing tasks needing human-like intelligence.
Types:
- ANI: Narrow AI (e.g., NLP, facial recognition).
- AGI: General AI (not yet achieved; handles any task).
- ASI: Superintelligence (future potential; surpasses human intelligence

Machine Learning (ML)

Definition: Systems learn from data to make predictions.
Types:
- Supervised (classification, regression).
- Unsupervised (clustering, pattern detection).
- Reinforcement learning (goal-oriented learning via rewards/penalties).
Data Splits: Training set (model building) vs. Testing set (performance testing)

Model Evaluation (Classification and Regression Metrics)

Classification Metrics:
- Type I Error: False positives.
- Type II Error: False negatives.
- Sensitivity: True positive rate.
- Specificity: True negative rate.
Regression Metrics: Absolute error, mean square error, relative error.

Applications of Data Science

Industry: Process automation & optimization (e.g., Shell.ai).
Business: Customer insights, sales predictions, HR analytics (e.g., Google People Analytics).
Text Data: Sentiment analysis, chatbots (e.g., Airbnb AI assistant).
Image Data: Object detection, scene classification (e.g., smart city transport planning).
Medical Data: Disease detection, drug development, remote monitoring (e.g., Pfizer predictive analytics).

Data science activities: Data Flow

Goal: Manage movement of data from source to access.
Data Collection:
- Identify sources, relevant attributes, and involved people.
- Methods: observation, interviews, surveys, existing records.
- Ensure accuracy & reliability.
Data Storage:
- Requirements: accessibility, transparency, security.
- Consider type, form, frequency of use, and cost.
- Options: distributed storage, cloud (watch for security risks).
Data Access:
- Structured data → queries, XML.
- Unstructured data → NoSQL technologies.

Data science activities: Data Curation

Goal: Refine and prepare data for use.
Data Preservation:
- Remove noise (errors), fix missing values.
- Depends on domain knowledge.
- Storage lifespan example: flash drives (80–90 yrs), HDDs (3–6 yrs).
Data Description:
- Use schemes & metadata for meaning.
- Enables understanding and better use.
Data Publication:
- Make data available for others.
- Clean, format, describe for usability & value.
Data Security:
- Identify threats: physical (infrastructure damage), human error, malicious activity.
- Measures: monitoring, firewalls, intrusion detection, encryption, physical safety.

Data science activities: Data Analytics

Goal: Extract insights & predict future events for decision-making.
Statistical Analysis:
- Methods: matrix-valued analysis, prediction models, PCA, clustering, sampling.
Modeling & Simulation:
- Test real-world scenarios with large datasets.
- Not 100% accurate, but useful.
- Examples: Monte Carlo, Markov Chain Monte Carlo.
Visual Techniques:
- Present results for better understanding.
- Examples: box plots, histograms, word clouds, charts.

Common Data Sources

Trustworthiness is key for robust, high-quality data.
Organizational / Trademarked Data:
- Data from company activities: transactions, customers, employees, products.
- Examples: Google, Facebook collect massive user data.
- Reluctance to share due to competition risk & customer privacy concerns.
Government Data:
- Many governments provide open data (e.g., demographics, economics).
- Example: USA’s Data.gov → 300,000+ datasets, privacy protected.
Academic Data:
- Large datasets from research in medicine, economics, history.
- Often publicly available via journals & databases (e.g., Google Scholar).
Social Media Data:
- Text, videos showing user interactions.
- APIs give structured access (e.g., Twitter API for tweet attributes).
- Useful for sentiment analysis & pattern detection.

Data Types

Quantitative Data (measurable)
- Discrete: Fixed numbers (e.g., number of students, age).
- Continuous: Range values
  - Interval → can be negative (e.g., °C/°F temperature).
  - Ratio → >0 only (e.g., height, weight).
Qualitative Data (descriptive)
- Nominal: No order (e.g., eye color).
- Ordinal: Ordered categories (e.g., salary grade).
- Binary: Two categories (e.g., on/off).

Data Shapes

Structured: Organized, tabular (rows & columns).
Unstructured: Unknown form (text, images, audio, video).
Semi-Structured: Partial organization (e.g., metadata, tags, e-mails: to and cc are structured, Body is Not)
Streaming Data: Continuous, real-time flow from multiple sources (e.g., social media feeds, sensors).

The 5 Vs of Data

Volume: Large scale (terabytes → zettabytes). Example: plane sensors generate 10 GB/sec.
Variety: Multiple types (structured, unstructured).
Velocity: Speed of creation & processing (e.g., 500 hrs video uploaded to YouTube/min).
Veracity / Validity: Trustworthiness & relevance (remove noise, avoid outdated data).
Value: Business benefits (new services, improved customer experience & operations).

Data Processing Definition and Goal

Definition: Extract useful information from raw data (often unstructured, incomplete, inconsistent).

Goal: Identify patterns, trends, correlations → enable analysis, decision-making, actionable insights.

Data Processing Pipeline

Data collection
Data preprocessing
Data analysis / model building
Insight implementation
Data storage (parallel to all stages).

Central tendency measures

describe the center of a dataset’s distribution
mean: arithmetic average of the variable’s values
mode: most repeated element in a dataset
median: value located exactly at the middle point of a variable’s sorted values

Variation measures

range: difference between highest and lowest value, heavily influenced by outliers
standard deviation: square difference between a unique element and the mean
variance: square of the standard deviation measure

mutually exclusive events

a particular event (N) cannot occur at the same time as another event (M)

mutually independent events

events can occur simultaneously

conditional probability

events can be correlated with each other —> dependent on each other
conditional probability: probability of event A, given that event B has already occured

categories of random variables

discrete random variables: can only take distinct values, e. g. students in a class, number of family members
continuous random variables: can take all values in an intervall, e. g. height of students

probability density function

probability of a continuous random variable to take a given value
not possible to calculate the probability for each value —> calculate the probability of a value within an interval by estimating the area under the curve within that interval

probability mass function

likelihood of a discrete random variable to take a given value
possible to find a porbability for each of the variables

normal distribution

bell-curve shape
mean: center of the bell
standard deviation: wideness of the bell shape

binomial distribution

whether or not an event occurs —> two possible outcomes
probability of getting a given outcome a precise number of times

formula to calculate binomial distribution

Poisson distribution

frequency with which an independent event occurs within a specific interval
not the probability of an event, but how often it occurs in a specific period of time

formula Poisson distribution

Bayesian Statistics

the knowledge available about the given parameters is updated with the new information gathered from the data observed
we want to know how event A is conditioned by event B but we only know how event B is conditioned by event A

formula Bayes Statistics

Data science Pipeline

Collection → Preprocessing → Analysis/Modeling → Insights Implementation → Storage.

Goal of data science

Extract knowledge from data to make predictions and guide decision-making

Join Course

Preview

Author

Marie R.

Information

Last changed
a month ago

Report course