undefined

Buffl

RMBA

by Luca I.

Linear regression

The simplest model
used to study the linear relationship between two variables

ß0 = intercept, ß1 slope (Steigung)

Can this be reversed?

Yes but the regeression line will look different

Why? Because OLS (ordinary least squares) minimizes squared errors of the dependent variable only.
- In the first model, we minimize errors in predicting avgHoursWatched.
- In the reversed model, we minimize errors in predicting income.

How to calculate the regression line?

Data (4 students):

Student A: (x=2, y=65)
Student B: (x=3, y=70)
Student C: (x=5, y=75)
Student D: (x=7, y=85)

How to interpret this regression line?

On average, the model predicts that without any training, the event occurs after about 100.28 days. For each additional hour of training, the expected time decreases by 1.22 days. However, this interpretation only makes sense within the range of the observed data — extrapolating too far leads to unrealistic predictions (e.g., negative days).

Properties of a regression line

The regression line should be trusted only for x values between the smallest and largest observed values([x_{(1)}, x_{(n)}]) -> otherwise extrapolation and unrealistic examples

Basically einfach die Formel für eine Gerade

A fundamental property: the regression line always goes through the averages of x and y

Residuals are the vertical distances from points to the line, The least squares method guarantees that they balance out to exactly zero
The average of predicted values = average of actual values
the least square estimate has a direct relationship with the Pearson correlation coefficient -> The slope is proportional to the correlation coefficient. But a stronger correlation does not imply a steeper slope. Positive correlation increasing sloep, negative colleraltion negative slope. BUT,
A strong correlation can produce a small slope if x has huge variability compared to y, and vice versa (depending on scale/ ratio)

Gauss-Markov Theorem assumption so that the OLS estimator is BLUE:

Best
Linear
Unbiased
Estimator

Linearity in parameters
No perfect multicollinearity
Exogeneity of independent variables
Homoscedasticity
Independence of errors

Linearity

Gauss-Markov Theorem:

The regression model must be linear in the coefficients (ß0, ß1)
Why? Linear structure allows OLS (Ordinary Least Squares) to derive closed-form solutions -> OLS finds the coefficients ß that minimize the sum of squared residuals

No perfect multicollinearity

Gauss-Markov Theorem

Regression needs each variable to bring in new information.

ensures that the matrix used to compute is invertible and has unique estimates (multicollinearity = one variable can be expressed as a linear combination of another variable, i.e., very high correlation)
To check for multicollinearity, we can compute the VIF (variance inflation factor) of the j-th variable as:

Example:

Regression is like figuring out how much each ingredient affects taste.

If two ingredients always move together (e.g., chocolate = 2 × sugar), regression can’t tell their effects apart (perfect multicollinearity).
If they move almost always together, regression can still work but the answers get messy and unreliable (high multicollinearity).

-> this is why we dropped one country in the 3rd Week at the end in the exersise

Exogeneity of independent variables

Gauss-Markov Theorem

the error term has an expected value of zero (and is
uncorrelated with the independent variable)
Imagine you’re trying to figure out how ice cream sales depend on temperature.
- Independent variable (X): temperature
- Dependent variable (y): ice cream sales
- Error term (ε): everything we didn’t measure but still affects sales (like holidays, advertising, or if there’s a beach nearby).
👉 The rule “Exogeneity” says:
- The extra stuff (error) should not be secretly connected to your main ingredient (X).
- If holidays (in the error) happen mostly on hot days (X), then temperature and error are tangled together.
- That makes it impossible to tell: “Is it really the hot weather that drives ice cream sales, or is it just the holiday?”

heteroscedasticity

Gauss-Markov Theorem

Heteroscedasticity refers to a situation in regression analysis where the variance of the error terms (residuals) is not constant across all levels of the independent variables.

-> although heteroscedasticity estimator can be ubiased

Imagine you’re measuring how tall kids are at different ages.

At age 5, kids are all kind of close in height (not much spread).
At age 15, some are very short, some very tall (a lot of spread).

So:

If the “spread” (scatter) of heights is about the same for every age → homoscedasticity (fancy word for “equal spread”).
If the spread gets bigger or smaller depending on the age → heteroscedasticity (“different spread”).

Independence of errors -> no autocorrelation

Gauss-Markov Theorem

Autocorrelation -> Errors depend on previous one

(no autocorrelation) – ensures that the estimates are efficient and that the standard errors are correctly estimated.

Imagine you’re watching the weather every day 🌦️.

If today’s error (how wrong your prediction was) has nothing to do with yesterday’s error, that’s independence of errors.
But if today’s error is similar to yesterday’s (e.g., every time you predict too low today, you also predict too low tomorrow), then the mistakes are linked → that’s autocorrelation.

👉 Why is this bad?

Because your model thinks it’s making “independent” mistakes, but in reality the mistakes are connected like dominoes. That makes you too confident in your results — like thinking you guessed right 10 times in a row, when in fact you only made 1 good guess and copied it 9 times.

-> Because the data has momentum or trends that the regression line doesn’t capture, it can be shaped like a wace

Goodness-of-fit: R^2

R^2 = variance explained + variance that can not be explained
R2 ranges from zero to one.
R2 larger than 70% is considered good fit.
-> 70% of the data can be explained
R is the variance explained by our regression
divided by the observed (total) variance:

Decomposed Variance

relationship between R2 and the Pearson correlation coefficient (r)

Different types of linear regressions

binary covariates
transformed covariates
multiple covariates
categorial covariates
transformed outcome variable

binary covariates

Different types of linear regressions

the influence of premium subscription (binary variable) on average hours watched (continuous variable)?

transformed covariates

Different types of linear regressions

What is the influence of the squared (or log) income on one average hour watched?

depending on the data we either apply ^2 or the log

we change the plotting, We just changed the axis from x to x^2, and in this new space the data are perfectly linear.

same for log

multiple covariates

Different types of linear regressions

If we are interested in the effect of more than one variable on our dependent variable, we need a multiple

linear regression.

categorial covariates

Different types of linear regressions

What if we want to include a categorical variable (with k categories) like the country in our regression model?

Procedure:

We create k-1 new binary variables and call them dummy variables
These dummy variables equal 1 for units that belong to the category and 0 otherwise.
The category for which we did not create a dummy variable is called the reference category.

transformed outcome variable

Different types of linear regressions

While in many situations, this makes interpretations quite difficult, a log transformation is quite common
and easy to interpret. Consider the log-linear model:

It also makes interpretation more natural: instead of “absolute changes” in y, we now talk about percentage changes.

significance of individual coefficients

Hypotheses testing

Two options

p-value
confidence interval

p-value

Hypotheses testing

The p-value is the probability of observing your data (or something more extreme) if the null hypothesis is true.
p-value < 0.05 → reject H_0, coefficient is significant.
p-value ≥ 0.05 → not enough evidence, fail to reject H_0.

H0 alwas assumes there is no “effect” -> coefficient is 0 -> variable 0 has no influence on var B -> tat’s whta we always want to reject

Confidence intervals

Hypotheses testing

We test coefficients against 0 because 0 means “no effect.
If 0 is ruled out, then the predictor likely has a real influence on the outcome.

If we repeated this experiment many times, 95% of the intervals we build would contain the true coefficient.

How to interpret regression tables (Focus on upper part)

1. Dep. Variable: avgHoursWatched

This is the outcome we are trying to predict.

2. R-squared: 0.356

The model explains 35.6% of the variance in avgHoursWatched.
That’s not bad for social science data (where behavior is influenced by many factors), but it also means 64.4% is unexplained.

3. Adj. R-squared: 0.356

Adjusted R² corrects for the number of predictors in the model.
Here it’s almost the same as R² → this suggests the model is not overfitting badly (adding predictors is really contributing some explanatory power).

4. F-statistic = 1105, Prob (F-statistic) = 0.00

The F-test checks whether the model, as a whole, is better than a model with no predictors.
Since p-value = 0.00 (less than 0.05), the model is statistically significant overall.
👉 At least one predictor is useful.

5. No. Observations: 20,000

You have a large sample size → more reliable estimates and narrower confidence intervals.

6. Log-Likelihood, AIC, BIC

These are information criteria used to compare models (lower values = better fit).
You’d only use them if you were comparing multiple regression models.

7. Covariance Type: nonrobust

Standard errors are computed under the assumption of homoscedasticity (constant variance of errors).
If there is heteroscedasticity, these SEs could be misleading — often researchers rerun with robust standard errors.

How to interpret regression tables (Focus on lower part)

The regression estimates show that the baseline (intercept) average hours watched is 2.91 for the reference group (likely Austria, since it is not listed as a dummy variable). The country dummy variables are interpreted relative to this reference group:

per country:

Users in Belgium watch on average 1.49 more hours compared to the reference, with a 95% confidence interval of [1.42, 1.56]. This effect is statistically significant (p < 0.001).
Users in France watch on average 1.01 more hours compared to the reference (95% CI [0.94, 1.08], p < 0.001).
Users in Germany watch on average 1.89 more hours compared to the reference (95% CI [1.82, 1.96], p < 0.001).

Other predictors:

Satisfaction has a positive effect of 0.24 hours per unit increase (95% CI [0.23, 0.26], p < 0.001).
Income has a very small but statistically significant negative effect (coefficient ≈ -2.45e-05, p < 0.001). This effect is negligible in practical terms.
Having a premium subscription (premSub) shows a coefficient of -0.0064, which is not statistically significant (p = 0.719). Therefore, premium subscription status does not have a meaningful effect on hours watched.

Intercept:

An Austrian user (because Austria is the baseline), with satisfaction = 0, income = 0, premSub = 0, has a predicted avgHoursWatched of 2.9057.

The problems of a binary dependent variable:

If we would fit a normal linear regression line, we
would predict values different from zero and one
predict these values as probabilities? Still, we would predict probabilities greater than 1 and smaller than 0?
heteroscedasticity

-> Solution: Logistic regression

Logistic regression

Instead of predicting directly “yes” or “no,” logistic regression predicts a probability:

This is much better because:

Probabilities always stay between 0 and 1.
We can still decide yes/no by setting a threshold (e.g., predict “yes” if probability > 0.5).

First Step:

You start with a linear regression–like idea by combining your input variables (like income, satisfaction, etc.) with weights (coefficients) to get a single number called z (the logit), which can take any value from -\infty to +\infty.

Second Step:

To turn z into a probability, we feed it into the logistic function, which squashes any value of z into the range between 0 and 1, avoiding impossible probabilities like –0.5 or 1.3.

-> In logistic regression, the independent variables combine linearly to form z, but once z is passed through the logistic function, the relationship between the inputs and the probability of y=1 becomes non-linear.

Estimation of a logistic regression

OLS is not working as it only words for linear data
Instead, we use Maximum Likelihood Estimation (MLE):
- We start with some guesses for the coefficients.
- Using them, we compute the probability of seeing each observed outcome (like “this person subscribed” = 1, “this person didn’t” = 0).
- We combine all these probabilities into one big “likelihood score.”
- Then we adjust the coefficients (ßs)until this likelihood is as high as possible.

In short: OLS minimizes errors, MLE maximizes the chance of the observed data.

pseudo R2

Goodness-of-fit

Pseudo R² metrics provide an indication of how well the model fits the data, but they don't have the same
interpretation as R² in linear regression.
Mc Fadden’s R: Compares your model (with predictors) to a null model (with no predictors).
-> R² = 0.3 : your model fits much better than the null model. (1 best)
Cox & Snell R: Similar idea, but its values never quite reach 1 (because of math limitations).
-> R² = 0.3 : your model explains 30% of the possible improvement compared to the null model
Nagelkerkes R: This fixes the Cox & Snell problem by stretching it so it can range from 0 to 1.
-> R² = 0.6 : strong fit.

Likelihood Ratio test

Goodness-of-fit

LR test: we compare the full model with a “null model” that contains only the constant term, 𝛽0. That is, we impose the restriction 𝛽1 = ⋯ = 𝛽𝑝 = 0.
Resulting test statistic: (2LL[full model]) – (2LL[restricted model])
It is chi2 distributed with degrees of freedom (number of coeffcients, ß) equal to the number of coefficients tha t areconstrained
The LR test corresponds to the F test in an OLS regression

-> So the formula gives you the LR value, and you use df when you look it up in the chi² distribution, to compute the p-value

classification table

Goodness-of-fit

We compute what percentage of observations are correctly predicted by the full model and compare this to the null model (i.e. , a model that assigns all observations to the most frequent outcome).

Example Cnfusion Matrix:

Although we lose some true positives we gain substantially more true negatives.

Interpretation of coefficients

𝛽 > 0 and p-value below T significance level
→ positive influence on P(Y=1)
𝛽 < 0 and p-value below significance level
→ negative influence on P(Y=1)
p-value above significance level
→ we cannot reject H0

Interpretation of the following Logit regression result, Upper part

Dependent variable: premSub (Premium subscription, likely 0 = no, 1 = yes).
Pseudo R² = 0.029 → This is very low. It means the model explains only ~3% of the variation in subscription behavior. Logistic regression R² values are usually lower than in linear regression, but this still suggests weak predictive power.
Log-Likelihood: −13437 (full model).
LL-Null: −13840 (null model, only intercept).
LLR p-value = 3.526e−174 → Extremely small → the model as a whole fits significantly better than the null model (so at least one predictor matters).

Interpretation of the following Logit regression result, Lower part

Intercept: Coefficient: −1.024 (significant, p < 0.001).
Income: Coefficient: 1.206e−05 (very small but highly significant, p < 0.001).
Avgterage Hours Watched and Satisfaction: not significant p-value > 0.05

Odds ratio

Interpretation of regression results becomes simpler if not p(y=1) is considered but the “odds”:

Interpretation:

OR = 1: No effect. The odds of the outcome occurring are the same, regardless of the value of the predictor.
OR > 1: Positive association. A one-unit increase in the predictor variable increases the odds of the outcome occurring. For example, an OR of 1.5 means that for every one-unit increase in the predictor, the odds of the outcome increase by 50%.
OR < 1: Negative association. A one-unit increase in the predictor variable decreases the odds of the outcome. For example, an OR of 0.5 means that for every one-unit increase in the predictor, the odds of the outcome decrease by 50%.

Example:

-> the odds increase by 49% for each unit increase in x_1.

Interpret the following result:

Model Fit:

PremSum is binary -> either 0 or 1 -> we apply logistic function -> Logit regression result
Pseudo R² = 0.029
- → The predictors explain about 2.9% of the variation in the log-odds of subscribing. This is weak explanatory power.
Log-Likelihood
- Null model: -13840
- Fitted model: -13437
  → Improvement is significantn (goal get closer to zero)
Likelihood Ratio (LR) test p-value = 3.5e-174
→ The model overall is statistically significant (at least one predictor helps explain premSub).

Coefficients:

Intercept = -1.0240, p < 0.001
→ When all predictors = 0, the log-odds of subscribing is -1.02 (probability < 0.5).
Income = 1.206e-05, p < 0.001
→ Statistically significant.
→ Each unit increase in income increases the log-odds of subscribing by 0.00001206.
→ Since income is probably measured in whole currency units, this is tiny per unit, but grows with scale.
→ In odds ratio terms: e^{0.00001206} ≈ 1.000012.
- A 10,000-unit increase in income → odds of subscribing increase by about 12%.
avgHoursWatched = 0.0137, p = 0.194
→ Not statistically significant.
→ Watching more hours does not meaningfully predict premium subscription.
Satisfaction = 0.0083, p = 0.501
→ Not statistically significant.
→ Satisfaction scores also do not explain subscription behavior here.

Confidence Intervals:

Income CI = [1.1e-05, 1.31e-05] → clearly positive, reinforcing significance.
Hours watched CI includes 0 → not significant.
Satisfaction CI includes 0 → not significant.

Join Course

Preview

Author

Luca I.

Information

Last changed
13 days ago

Report course