Characteristics: Science, Theory and Facts:
Science: use of evidence, testable explanations and predictions
Theory: comprehensive explebation support by vast body of evidence
Facts: tested and confirmed so many times, so that there is no doubt
Inductivism
Inductivism: Bottom-up reasoning → from observations to general laws (creates theories, but not certain).
-> infer laws from examined cases
Problem: no non-circular way to justify predictions
Deduction: Top-down reasoning → from general laws to specific cases (tests/applies theories, certain if premises are true).
guess → predict → test → try to disprove.
Deductive research approach
Formulate research question
Theory & hypotheses
Method selection
Data gathering
Data cleaning
Exploratory data analysis
Modelling
Criteria for good research question:
• (new, interesting, relevant)
• Clarity
• Focus
• Feasibility
• Testability
• Should not be leading
Research methods:
Qualitative
Quantitaive
Data collection technique
Interviews, Delphi, textual analysis
Survey, Experiment, Web-scraping
Data analysis techniques
Case writing, Mind maps, Word frequency counting
Machine learning, Cluster analysis, Factor analysis
Empirical research
Econometrics vs. Machine Learning
Econometrics
Machine learning
Helps us with…
understanding why something
happens
predict what will happen
Theory-driven, causal inference,
interpretability
Data-driven, predictive power,
flexibility
Variables
Name
Expl.
Examples
Qualitative variables
Cannot be ordered in a logical or natural way
hair color
names
Quantitative var.
Measurable quantities
body height
temperature
Discrete var.
Take finite number of
values
ratings from 1-5
Continuous var.
Can take infinite number of values
Grouped var.
also called categorical
income
Binary var.
Grouped variable with
only two values
adult/ child
Scales
Empirical cumulative distribution function (CDF), for ordinal Values:
10% has a satiscaction (st) of 1
23% has an axact st of 5
25% has a st of 2 or less
Empirical cumulative distribution function (CDF), for metric Values:
40% watches 1 hour or less
40% watches between 1 and 2 hours
80% watches 2hours or less
the small spike at the end means that all values >10 were combined
Arithmetic mean:
“Durchschnitt”
sum of all deviations is zero
good intuition if the data is normally distributed
Caution! Outliers can severely bias this intuition
Median
(measure of central tendency9
compared to the mean, the median is more robust to outliers
Mean vs. median, interpretation:
Mean and median are similar if the data is symmetrically
distributed.
If the data has more than one center, neither mean
nor median have meaningful interpretations.
Mean and median may differ if the data is skewed.
If the data has outliers it is better to use the median
as it is robust to outliers.
quantiles & percentiles
Measures of central tendency
Q-Q plots:
To compare the distribution of two variables
Luigi is slower than domenico
Mario and Salvadore deliver equally fast -> variables have the same distribution
Mode
value that occurs most compared to other values
Mean vs. Median vs. Mode
As the distibution is right/ positively skewed the median is > than the mean, as all small values count equally the mean get’s “influences”
What is skewness?
measue of asymetry
How is this plot skewed?
right/ positively skewed
“Where the tail is”, so the following chart is right-skewed
Why positive? Because in a right-skewed distribution:
The mean is pulled to the right (towards larger, positive deviations from the median/mode).
When you calculate the skewness formula, those big positive deviations dominate, so the result is > 0.
Measures of dispersion:
Absolute deviation
Mean squared error
Variance
Standard deviation
Absolute deviation:
We take the absolute as otherwise negative values might bias the deviation metric
Pros:
less sensitive to outliers than the MSE or Variance
easy to interpret
Cons:
MSE of the mean
Pros: Mathematically convenient
Cons: hard to interpret and sensitive to outliers
Standard deviation:
square root of variance
Pros: Interpretable, widely used
Cons: Sensitive to outliers
Standardization:
var is standardized if its mean is 0 and the variance is 1
To standardize a variable, we subtract its mean
and divide it by its standard deviation
Why standardization:
makes values better comparable
faster convergence ( convergense means reaching the optimal solution in iterative algorithms.)
comuting distanves more appropriate
Box Plots
How can we interprete the Boxplot for the 8th month?
IQR (Interquartile Range): Quite large → high variability, NOT Variance, as IQR focuses only on the middle 50%
No extreme outliers visible for this month
The median being closer to Q1 means the lower half of the data (Q1 → median) is more concentrated (shorter spread)
If this were shown as a density curve: The longer spread above the median suggests a right/positive skew
Kurtosis with example
measures the tailedness of a distribution (skewness measured the asymmetry)
A portfolio with a low kurtosis value indicates a more stable and predictable return profile which may indicate lower risk. Investors may intentionally seek investments with lower kurtosis values when they're building safer, less volatile portfolios
Covariance
Measures the degree to which two variables vary
together (values from -inf until inf)
The unit of the covariance is expressed in the product
of the units of both variables → difficult to interpret
positive covariance, increase or decrease together
negative cov, if one increases the other decreases
cov zero -> no linear relationship
For example you want to see the covariance between hours watched and income, at the end you get a number that has hours and € included -> difficult to interpret
Pearsons correlation coefficient
Correlation measures both the strength and direction of
the linear relationship between two variables
normalized and unitless version of covariance (-1 to 1)
Interpretation:
correlation close to +1 or −1 indicates a strong linear
relationship
A correlation close to 0 indicates a weak or no linear
While covariance measures only the direction,
correlation also indicates the strength of the relationship
Spearman’s rank correlation coefficient
The values of R (unitless) lie between −1 and +1 and measures the degree of correlation between the ranks of X and Y.
R = 1 → all observations are assigned the same rank
R = -1 → all observations are assigned the opposite rank
Spearman vs Pearson
Pearson’s correlation looks at raw values → best for linear, normally distributed data.
Spearman’s correlation looks at ranks → useful for ordinal data, non-linear monotonic trends, robust to outliers.
Example: If exam scores increase with study hours but in a curved (non-linear) way, Pearson might be low, but Spearman will still be high, since the ranking order is preserved
Random Variables and Probabiity density function (PDF)
Discrete random varaibles
Continuous random varaibles
Example
Number of heads appearing in 10 flips of a coin
Unemployment rate
Max. temperature on a given day
Function
writing a function is possible(sum all P’s is 1)
PDF
As we cannot assign a probability to any real value, we use the density.
The density is the relative probability that a variable falls into an interval. To obtain a probability, we need to integrate over this interval. (This gives us the area under the curve).
Probability density function
Cumulative density function
The cumulative density function (cdf) describes the probability that a random variable takes a variable smaller than a given number
CDF deals with continuous and discrete random variables
Idndependent Variables
(Joint distribution)
Two random variables, X and Y are independent if and only if:
Conditional Distribution
We are usually interested in how one random variable is related to one or more other random variables.
The conditional distribution of Y given X tells us the distribution of Y conditional on us having information about X
Ex. • Probability of blue eyes given that a person is Portuguese
Conditional Distribution Example
Bayes’ Theorem Calculation Example:
Interesting features of probability distributions
Measures of variability or spread
Measures of association between two random variables
Expected Value
Measure of central tendency
The expected value of a random variable X, E(X), is a weighted average of all possible values of X.
Measure of variability
The variance of a random variable X is a measure of its dispersion around the expected value, μ ≡ E(X):
The standard deviation of a random variable, sd(X), is simply the square root of the variance:
Measure of association
It is useful to have summary measures of how two random variables vary with one another
Correlation
The correlation coefficient is a measure of the degree of linear relationship between X and Y
Variance vs Correlation
Covariance = raw measure of joint variability, scale-dependent.
Correlation = normalized covariance, scale-independent, bounded between -1 and 1.
-> Think of correlation as the unit-free, standardized version of covariance
Deepdive correlation:
Zero correlation only rules out linear association.
independence always implies uncorrelated.
Uncorrelated does not imply independence. (counterexample: Y=X^2).
Standard continuous distributions
uniform distribution
normal “
Exponential
Uniform distribtion
Waiting time for a bus (if buses arrive every 10 minutes, and you arrive at a random time → Uniform(0,10)).
Randomly selecting a decimal number between 0 and 1 → Uniform(0,1).
Normal distribtion
N(0,1) -> standard normal distribution N(μ,σ2)
μ = mean (where the pak is, shifts <->) (verschiebt)
σ2 = variance -> how high the curve is (staucht oder streckt)
Central Limit Theorem (CLT)
The Central Limit Theorem states that when you take a sufficiently large sample size n from a population with any shape of distribution (with a finite mean and variance), the distribution of the standardized sample mean will approximate a standard normal distribution. This approximation improves as the sample size increases, regardless of the original distribution's shape.
CLT Example, Coin Flip with sample n=3 and 10:
There are 2^3=8 equally likely outcomes. I enumerated them and computed the sample mean for each one (you’ll see the full table titled “n=3 – All 8 outcomes and sample means” just above).
n=10 -> more like a bell curve
Zuletzt geändertvor 2 Tagen