Statistical inference:
allows us to infer from the sample data about the population of interest
at a given, pre-specified uncertainty level
and knowledge about the random process generating the data.
Population:
a well-defined group of subjects, such as individuals, firms, cities, and so on
Parametric estimation:
we make some assumptions about the distribution and model the distribution with parameters (e.g., normal distribution with 𝜇(mean) and 𝜎2(Variance) as parameters)
Example:
You assume people’s heights follow a Normal distribution.
You estimate the mean and std from your sample.
Once you know those two parameters, you can describe the entire distribution.
Pros:
Very efficient if your assumption is correct (small amount of data gives good results).
Simple formulas and interpretation.
Cons:
If your assumption is wrong (e.g., data aren’t Normal), results can be misleading.
Non-parametric estimation:
we make no distributional assumptions
You estimate the density of incomes using a histogram or kernel density estimate (KDE) — no assumption of Normality.
Or: you use the sample median as a measure of “typical” value instead of assuming the mean of a Normal distribution.
More flexible; works even if the data are skewed or irregular.
Fewer risks from making the wrong assumption.
Often need more data to get stable results.
Methods can be more computationally intensive and harder to interpret.
Estimator and estimate:
Estimator = a rule or formula for how you’re going to estimate something.
Example: “Take the average of my sample” is an estimator of the population mean.
Estimate = the actual number you get when you apply that rule to your data.
Example: “The mean of my 100 people is 40” is an estimate.
So: estimator = recipe, estimate = dish. 🍝
Propertties of estimators
Bias (systematic error)
Efficency
Consistency
Sufficency
Bias of an estimator
It follows that an estimator is said to be unbiased if its bias is zero. This implies that an estimator is
unbiased if:
Note:
Unbiased doesn’t mean each estimate equals the true value.
It means if you repeated the experiment many times and averaged all your estimates, the average would equal the truth.
So: unbiasedness is about being “right in expectation,” not “right every time.”
How to dmeonstrate that an estimator is consistent?
(From jupyter Notebook)
Unbiasedness: The expected value of the sample mean converges to the true population mean as the sample size increases.
Convergence: The variance of the sample mean decreases as the sample size increases, approaching zero, which means that as the sample size grows, the sample mean becomes increasingly concentrated around the population mean.
What kinf of variance is preferred for an estimator?
Sampling variance of estimators (efficiency)
If an estimator has a large variance, we get to know little about the parameter we want to estimate
If the estimator has a small variance, we can be more confident about our estimate of the parameter
Both estimators are unbiased, but using W1 it is more likely that we will get an estimate closer to 𝜽
Variance Efficency
Var1 is more efficient relative to Var2
Comparing biased estimators
A common way to compare biased estimators is the mean squared error (MSE):
The MSE measures how far the estimator is away from θ, on average.
If we compare unbiased estimators, we only care about their efficiency becaus
Bias = systematic error, Variance= radnom error
An estimator W is consistent if both its bias and variance tend to zero as the sample size increases
8Crosslink to CLT)
Central Limit Theorem combined with Estimator
We have seen that the sample average, 𝑌 is a consistent and unbiased estimator of the population mean, μ:
!!! With the CLT we have information about the distribution of an estimator even without knowing the distribution of the original population !!!
Sufficiency
In general, we can say that if all the information about μ contained in the sample of size n can be obtained
for example, through the sample mean,
then it is sufficient to use this one-dimensional
summary statistic to make an inference about μ
Types of estimations
Point Estimation (The average user satisfaction is 3.25)
best exp. value
only considers exp. value
no information about accuracy of estimate
Interval estimation (The average user satisfaction lies between 3 and 3.5.)
Considers expectation and variation of the data
Incorporates information about the estimate’s accuracy
Procedure to maximize the lieklihood function
Pick a distribution that best describes your data
Write down the likelihood function for the observed data
Compute the log-likelihood function
Differentiate the log-likelihood with respect to each
parameter (jeweils nach beiden Variablen ableiten)
Solve the resulting equations to find the parameter
estimates (-> ergibt das Maximum)
What is the log doing with the Lokelihood function
Taking the log doesn’t change where the peak is — it just makes the curve easier to work with mathematically and numerically.
Plot
On the left, the likelihood curve is very squished (values are tiny).
On the right, the log-likelihood stretches it out and makes it easier to see the peak.
Both peak at the same place: around p \approx 0.70.
What is the advantage of using logs?
Numerical stability: Probabilities are often very small; logs avoid underflow.
Simplifies math: Products of probabilities become sums of logs (differentiaiting is easier)
Same maximizer: The value of p that maximizes L(p) also maximizes \log L(p).
Explanation Underflow:
Likelihoods are often products of many probabilities.
Example: a probability like 0.7.
If you multiply 0.7 by itself 1000 times:
0.7^{1000} \approx 1.26 \times 10^{-155}
Still fine.
But if you multiply 0.01^{1000}:
10^{-2000}
That’s so small the computer can’t represent it → it stores it as 0.
Once the likelihood becomes zero, you’ve lost all the information — you can’t maximize it anymore.
Interval estimation and confidence intervals (Theory)
A confidence interval (CI) is a random interval with lower and upper bounds (bounds are random)
such that the unknown parameter 𝜃 is covered by a pre-specified probability of at least 1 − 𝛼:
1 − 𝛼 is called the confidence level
Therefore, we say: The parameter is covered by the CI with a probability of 1 − 𝛼
Interval estimation and confidence intervals (Example plot)
population that follows a normal distribution with 𝜇 = 10 𝑎𝑛𝑑 𝜎2 = 1. We draw 6 samples
Interpretation of Cis:
Most but not all Cis cover the population
mean
The confidence level indicates how often
it covers the true mean
Calculating the confidence interval
𝜎 is the standarddeviation sometimes also written as s -> std is the squareroot of the variance 𝜎*2->
How to calculate tha z value for the interval ⍺=5%
1-(⍺/2) -> 1-(0.05/2) -> 1-0.025 -> 0.975
Looking up in a standard normal table (or using software):
What does the confidence interval f.e. 95% means)
if we repeated the sampling many times, 95% of those intervals would contain the true population mean \mu.
Confidence Intervl vs. prediction Interval:
PI, The variability of the mean (as in the CI), plus The natural variability of individual observations around the mean.
CI
PI
Margin of error and how to compute the upper and lower border of our CI
Formula (Var known): ME =z_score * standard_error
Formula (Var unknown): ME =t_score * standard_error
(standard error = population_std/np.sqrt(sample_size))
z_score = stats.norm.ppf(1-alpha/2)
Lower and upper border for CI (Python code):
lower_bound = sample_mean - margin_of_error
upper_bound = sample_mean + margin_of_error
Confidence interval if the 𝜎*2 (Variance) is unknown
The only thing that changes is that we need to compute 𝑡 from a different distribution and estimate S.
Is the new confidence interval wider or more narrow? (compared to when variance is known) -> wider
T-Distribution:
also known as Student’s t-distribution
statistical function that creates a probability distribution
similar to the normal distribution, with its bell shape, but it has heavier tails -> greater chance of extreme values than normal distributions
used for estimating population parameters for small sample sizes or unknown variances
The t-distribution is the basis for computing t-tests in statistics
predict how likely certain outcomes are when only a limited amount of data is available
T-distribution, degree of freedom: 4
heaviness is determined by a parameter of the t-distribution called degrees of freedom
with smaller values giving heavier tails
and with higher values making the t-distribution resemble a standard normal distribution with a mean of 0 and a standard deviation of 1.
Hypotheses testing Procedure
Define distributional assumption
Formulate a hypothesis about the population
called the null hypothesis 𝑯𝟎 and the alternative
hypothesis 𝑯𝟏
Assess how likely it is that 𝐻0 is true.
if 𝐻0 is rejected, we assume 𝐻1 to be true as
if 𝐻0 cannot be rejected it does not mean that it is
true
onde sided
two sided
one sample problem
two sample problem
Hypotheses testing example:
Your sales “expert” tries to convince you that the
average satisfaction on your platform is at least 4.
How is H0 and H1 called? (one sided and one sample)
H0: the average satisfaction is smaller or equal to 4.
H1: the average satisfaction is greater than 4
How is H0 and H1 called? (two sided and one sample)
H0: the average satisfaction is equal to 4.
H1: the average satisfaction is not equal to 4.
The expert also claims that Portuguese, on
average, are more satisfied with the platform than
Germans.
How is H0 and H1 called? (one sided and two sample)
H0: the average satisfaction of Portuguese is smaller or equal to the average satisfaction of Germans
H1: the average satisfaction of Portuguese is greater than the average satisfaction of Germans
How is H0 and H1 called? (two sided and two sample)
H0: the average satisfaction of Portuguese equals the
average satisfaction of Germans
H1: the average satisfaction of Portuguese does not
equal the average satisfaction of Germans.
Graphic for one- and two-sided tests
For one-sided tests, H0 is easier to reject if the effect is in the expected direction. If not, it will never reject H0, even if the effect is large
→ two-sided is more conservative and protects against surprises in either direction.
“Effect in the expected direction” means the observed data supports the specific one-sided hypothesis you set up in advance (e.g., “new drug is better” → higher mean). Does not cover that the new drug is worse.
One-tailed tests are fine if you truly only care about one side (e.g., proving improvement).
Two-tailed tests are safer when effects in both directions matter (e.g., the new drug might be worse, which is also crucial to detect).
Type I and II errors:
Type 1 error:
False positive:
Type I error = Rejecting H_0 when H_0 is true.
In plain words: You claim there is an effect (positive result) when in fact there is none.
Medical test says: “You have the disease”
Reality: You’re actually healthy.
The test gave you a positive result, but it was false.
-> False Positive (seeing something that isn’t there)
!!!This is the error we want to prevent !!!
Type II error
False Negative
Type II error = Failing to reject H_0 when H_0 is false.
In plain words: You fail to detect an effect when there actually is one.
Medical test says: “You don’t have the disease”
Reality: You do have the disease.
The test gave you a negative result, but it was false.
-> failing to see something that is there
What does significance mean when talking about error’s?
The significance level is the probability of a type one error:
-> H1 falsely accepted and HO falsely rek´jected although true
Why can’t we minimize both 𝛼 and 𝛽(Probability of Type II error)
If we make the rules very strict (tiny \alpha), it’s harder to convict → fewer innocent people are wrongly convicted, but more guilty people go free (\beta increases).
If we make the rules lenient (larger \alpha), it’s easier to convict → fewer guilty people go free, but more innocent people risk conviction.
Test decision using the p-value
The p-value is the probability of obtaining test results at least as extreme as the ones observed during the test, assuming that the null hypothesis H0 is true.
The p-value is the area under the curve of the hypothesized (H0) distribution.
If this value is very unlikely (i.e., p-value less than 5%), we reject H0.
Test decision using the p-value Example
H_0: The new drug is no better than the old one.
Data: Patients on the new drug live, on average, 2 years longer than those on the old drug.
p-value: The probability of seeing such a large difference (or bigger) if the new drug actually does nothing special.
If p = 0.03 (3%), this means “there’s only a 3% chance we’d see this result if the drug had no effect.”
Since 3% < 5%, it’s unlikely the difference is just by chance → we reject H_0 and conclude the drug is effective.
If p = 0.30 (30%), this means “it’s not unusual to see this result just by chance.”
Since 30% > 5%, we cannot reject H_0 → we don’t have strong evidence the drug works.
Test decision using the confidence interval:
If the appropriate confidence interval (100(1 − α)%) does not cover the value 𝜃0 targeted in the hypothesis, then H0 is rejected.
Imagine we test whether the average height of students is 170 cm.
H_0: Mean height = 170 cm.
From our data, we calculate a 95% confidence interval: [172 cm, 180 cm].
Since 170 is not inside [172, 180], we reject H_0.
If instead the CI was [168 cm, 176 cm], then 170 is inside → we do not reject H_0.
Comparing sample means: one sample t-test, for the following question:
Is our users’ satisfaction level above the target level of 3.5?
sample mean = 3.7
n = 40
sample std = 0.6
Hypothesized mean = 3.5
Critical Value from t distribution:
𝐻0 ∶ 𝜇 ≤ 3.5
𝐻1: 𝜇 > 3.5
2.11 > 1.69 -> we reject H0
Conclusion: The data provide significant evidence that user satisfaction is above 3.5.
Comparing sample means: two sample t-test, for the following question:
Do Dutch and Spanish users differ regarding their satisfaction?
What are the distribution assumptions behind the t-test?
For large samples (n > 30 per group), the t-test works even if data are not normal → thanks to the Central Limit Theorem.
For small samples, normality matters more.
Be cautious with heavily skewed data or outliers → the mean may not be a meaningful measure.
Practical vs. statistical significance:
statistical significance (ss):
a way of testing whether it is likely to observe a value assuming a hypothesis is true
sometimes ss has little practival implications -> we should always look at the practical implications of our conckusions
You have sample of 20,000 netflix users. You are interested in the relationship between the average hours users watch movies and if the users have a premium subscription. To test this relationship you want to formulate different hypotheses.
Please write down
(1) a one sample, one-sided hypothesis
(2) a one sample, two-sided hypotheses
(3) a two-sample, one-sided hypothesis
(4) and a two-sample, two-sided hypothesis.
For each of the hypothesis specify a null hypothesis and an alternative hypothesis.
H0: the avgHoursWatched <=2 hours
H1: the avgHoursWatched > 2 hours
HO: the avgHoursWatched is = 2hours
H1: the avgHoursWatched is != 2hours
H0: people with premSub have a higher average
H1: people with noremSub have a lower average
H0: people with and without premSub have the same average
H0: eople with and without premSub have NOT the same average
How to interpret the following result for these hypothesis:
H₀: The average income of Dutch users is less than or equal to €50,000.
H₁: The average income of Dutch users is greater than €50,000.
For this Hypothesis only the 1st Part is relevant:
t = 29.48 → your sample mean is 29.5 standard errors above 50,000. Far away in the positive direction
p-value 0: refering to H0 “Niederländer earn less”.
If the true mean were ≤ 50,000, the chance of observing a result as extreme as 29.48 SE above is essentially zero.
Therefore, reject H₀ → conclude the mean is greater than 50,000.
P-value:
p = P(observing a test statistic as extreme or more extreme than the one you got I H_0 is true)
It assumes the null hypothesis H_0 is true.
It asks: how unusual (0) are my observed results under that assumption?
Last changed12 hours ago