Linear regression
The simplest model
used to study the linear relationship between two variables
ß0 = intercept, ß1 slope (Steigung)
Can this be reversed?
Yes but the regeression line will look different
Why? Because OLS (ordinary least squares) minimizes squared errors of the dependent variable only.
In the first model, we minimize errors in predicting avgHoursWatched.
In the reversed model, we minimize errors in predicting income.
How to calculate the regression line?
Data (4 students):
Student A: (x=2, y=65)
Student B: (x=3, y=70)
Student C: (x=5, y=75)
Student D: (x=7, y=85)
How to interpret this regression line?
On average, the model predicts that without any training, the event occurs after about 100.28 days. For each additional hour of training, the expected time decreases by 1.22 days. However, this interpretation only makes sense within the range of the observed data — extrapolating too far leads to unrealistic predictions (e.g., negative days).
Properties of a regression line
The regression line should be trusted only for x values between the smallest and largest observed values([x_{(1)}, x_{(n)}]) -> otherwise extrapolation and unrealistic examples
Basically einfach die Formel für eine Gerade
A fundamental property: the regression line always goes through the averages of x and y
Residuals are the vertical distances from points to the line, The least squares method guarantees that they balance out to exactly zero
The average of predicted values = average of actual values
the least square estimate has a direct relationship with the Pearson correlation coefficient -> The slope is proportional to the correlation coefficient. But a stronger correlation does not imply a steeper slope. Positive correlation increasing sloep, negative colleraltion negative slope. BUT,
A strong correlation can produce a small slope if x has huge variability compared to y, and vice versa (depending on scale/ ratio)
Gauss-Markov Theorem assumption so that the OLS estimator is BLUE:
Best
Linear
Unbiased
Estimator
Linearity in parameters
No perfect multicollinearity
Exogeneity of independent variables
Homoscedasticity
Independence of errors
Linearity
Gauss-Markov Theorem:
The regression model must be linear in the coefficients (ß0, ß1)
Why? Linear structure allows OLS (Ordinary Least Squares) to derive closed-form solutions -> OLS finds the coefficients ß that minimize the sum of squared residuals
Gauss-Markov Theorem
Regression needs each variable to bring in new information.
ensures that the matrix used to compute is invertible and has unique estimates (multicollinearity = one variable can be expressed as a linear combination of another variable, i.e., very high correlation)
To check for multicollinearity, we can compute the VIF (variance inflation factor) of the j-th variable as:
Example:
Regression is like figuring out how much each ingredient affects taste.
If two ingredients always move together (e.g., chocolate = 2 × sugar), regression can’t tell their effects apart (perfect multicollinearity).
If they move almost always together, regression can still work but the answers get messy and unreliable (high multicollinearity).
-> this is why we dropped one country in the 3rd Week at the end in the exersise
the error term has an expected value of zero (and is
uncorrelated with the independent variable)
Imagine you’re trying to figure out how ice cream sales depend on temperature.
Independent variable (X): temperature
Dependent variable (y): ice cream sales
Error term (ε): everything we didn’t measure but still affects sales (like holidays, advertising, or if there’s a beach nearby).
👉 The rule “Exogeneity” says:
The extra stuff (error) should not be secretly connected to your main ingredient (X).
If holidays (in the error) happen mostly on hot days (X), then temperature and error are tangled together.
That makes it impossible to tell: “Is it really the hot weather that drives ice cream sales, or is it just the holiday?”
heteroscedasticity
Heteroscedasticity refers to a situation in regression analysis where the variance of the error terms (residuals) is not constant across all levels of the independent variables.
-> although heteroscedasticity estimator can be ubiased
Imagine you’re measuring how tall kids are at different ages.
At age 5, kids are all kind of close in height (not much spread).
At age 15, some are very short, some very tall (a lot of spread).
So:
If the “spread” (scatter) of heights is about the same for every age → homoscedasticity (fancy word for “equal spread”).
If the spread gets bigger or smaller depending on the age → heteroscedasticity (“different spread”).
Independence of errors -> no autocorrelation
Autocorrelation -> Errors depend on previous one
(no autocorrelation) – ensures that the estimates are efficient and that the standard errors are correctly estimated.
Imagine you’re watching the weather every day 🌦️.
If today’s error (how wrong your prediction was) has nothing to do with yesterday’s error, that’s independence of errors.
But if today’s error is similar to yesterday’s (e.g., every time you predict too low today, you also predict too low tomorrow), then the mistakes are linked → that’s autocorrelation.
👉 Why is this bad?
Because your model thinks it’s making “independent” mistakes, but in reality the mistakes are connected like dominoes. That makes you too confident in your results — like thinking you guessed right 10 times in a row, when in fact you only made 1 good guess and copied it 9 times.
-> Because the data has momentum or trends that the regression line doesn’t capture, it can be shaped like a wace
Goodness-of-fit: R^2
R^2 = variance explained + variance that can not be explained
R2 ranges from zero to one.
R2 larger than 70% is considered good fit.
-> 70% of the data can be explained
R is the variance explained by our regression
divided by the observed (total) variance:
Decomposed Variance
relationship between R2 and the Pearson correlation coefficient (r)
Different types of linear regressions
binary covariates
transformed covariates
multiple covariates
categorial covariates
transformed outcome variable
the influence of premium subscription (binary variable) on average hours watched (continuous variable)?
What is the influence of the squared (or log) income on one average hour watched?
depending on the data we either apply ^2 or the log
we change the plotting, We just changed the axis from x to x^2, and in this new space the data are perfectly linear.
same for log
If we are interested in the effect of more than one variable on our dependent variable, we need a multiple
linear regression.
What if we want to include a categorical variable (with k categories) like the country in our regression model?
Procedure:
We create k-1 new binary variables and call them dummy variables
These dummy variables equal 1 for units that belong to the category and 0 otherwise.
The category for which we did not create a dummy variable is called the reference category.
While in many situations, this makes interpretations quite difficult, a log transformation is quite common
and easy to interpret. Consider the log-linear model:
It also makes interpretation more natural: instead of “absolute changes” in y, we now talk about percentage changes.
significance of individual coefficients
Hypotheses testing
Two options
p-value
confidence interval
The p-value is the probability of observing your data (or something more extreme) if the null hypothesis is true.
p-value < 0.05 → reject H_0, coefficient is significant.
p-value ≥ 0.05 → not enough evidence, fail to reject H_0.
H0 alwas assumes there is no “effect” -> coefficient is 0 -> variable 0 has no influence on var B -> tat’s whta we always want to reject
Confidence intervals
We test coefficients against 0 because 0 means “no effect.
If 0 is ruled out, then the predictor likely has a real influence on the outcome.
If we repeated this experiment many times, 95% of the intervals we build would contain the true coefficient.
How to interpret regression tables (Focus on upper part)
This is the outcome we are trying to predict.
The model explains 35.6% of the variance in avgHoursWatched.
That’s not bad for social science data (where behavior is influenced by many factors), but it also means 64.4% is unexplained.
Adjusted R² corrects for the number of predictors in the model.
Here it’s almost the same as R² → this suggests the model is not overfitting badly (adding predictors is really contributing some explanatory power).
The F-test checks whether the model, as a whole, is better than a model with no predictors.
Since p-value = 0.00 (less than 0.05), the model is statistically significant overall.
👉 At least one predictor is useful.
You have a large sample size → more reliable estimates and narrower confidence intervals.
These are information criteria used to compare models (lower values = better fit).
You’d only use them if you were comparing multiple regression models.
Standard errors are computed under the assumption of homoscedasticity (constant variance of errors).
If there is heteroscedasticity, these SEs could be misleading — often researchers rerun with robust standard errors.
How to interpret regression tables (Focus on lower part)
The regression estimates show that the baseline (intercept) average hours watched is 2.91 for the reference group (likely Austria, since it is not listed as a dummy variable). The country dummy variables are interpreted relative to this reference group:
per country:
Users in Belgium watch on average 1.49 more hours compared to the reference, with a 95% confidence interval of [1.42, 1.56]. This effect is statistically significant (p < 0.001).
Users in France watch on average 1.01 more hours compared to the reference (95% CI [0.94, 1.08], p < 0.001).
Users in Germany watch on average 1.89 more hours compared to the reference (95% CI [1.82, 1.96], p < 0.001).
Other predictors:
Satisfaction has a positive effect of 0.24 hours per unit increase (95% CI [0.23, 0.26], p < 0.001).
Income has a very small but statistically significant negative effect (coefficient ≈ -2.45e-05, p < 0.001). This effect is negligible in practical terms.
Having a premium subscription (premSub) shows a coefficient of -0.0064, which is not statistically significant (p = 0.719). Therefore, premium subscription status does not have a meaningful effect on hours watched.
Intercept:
An Austrian user (because Austria is the baseline), with satisfaction = 0, income = 0, premSub = 0, has a predicted avgHoursWatched of 2.9057.
The problems of a binary dependent variable:
If we would fit a normal linear regression line, we
would predict values different from zero and one
predict these values as probabilities? Still, we would predict probabilities greater than 1 and smaller than 0?
-> Solution: Logistic regression
Logistic regression
Instead of predicting directly “yes” or “no,” logistic regression predicts a probability:
This is much better because:
Probabilities always stay between 0 and 1.
We can still decide yes/no by setting a threshold (e.g., predict “yes” if probability > 0.5).
First Step:
You start with a linear regression–like idea by combining your input variables (like income, satisfaction, etc.) with weights (coefficients) to get a single number called z (the logit), which can take any value from -\infty to +\infty.
Second Step:
To turn z into a probability, we feed it into the logistic function, which squashes any value of z into the range between 0 and 1, avoiding impossible probabilities like –0.5 or 1.3.
-> In logistic regression, the independent variables combine linearly to form z, but once z is passed through the logistic function, the relationship between the inputs and the probability of y=1 becomes non-linear.
Estimation of a logistic regression
OLS is not working as it only words for linear data
Instead, we use Maximum Likelihood Estimation (MLE):
We start with some guesses for the coefficients.
Using them, we compute the probability of seeing each observed outcome (like “this person subscribed” = 1, “this person didn’t” = 0).
We combine all these probabilities into one big “likelihood score.”
Then we adjust the coefficients (ßs)until this likelihood is as high as possible.
In short: OLS minimizes errors, MLE maximizes the chance of the observed data.
pseudo R2
Goodness-of-fit
Pseudo R² metrics provide an indication of how well the model fits the data, but they don't have the same
interpretation as R² in linear regression.
Mc Fadden’s R: Compares your model (with predictors) to a null model (with no predictors).
-> R² = 0.3 : your model fits much better than the null model. (1 best)
Cox & Snell R: Similar idea, but its values never quite reach 1 (because of math limitations).
-> R² = 0.3 : your model explains 30% of the possible improvement compared to the null model
Nagelkerkes R: This fixes the Cox & Snell problem by stretching it so it can range from 0 to 1.
-> R² = 0.6 : strong fit.
Likelihood Ratio test
LR test: we compare the full model with a “null model” that contains only the constant term, 𝛽0. That is, we impose the restriction 𝛽1 = ⋯ = 𝛽𝑝 = 0.
Resulting test statistic: (2LL[full model]) – (2LL[restricted model])
It is chi2 distributed with degrees of freedom (number of coeffcients, ß) equal to the number of coefficients tha t areconstrained
The LR test corresponds to the F test in an OLS regression
-> So the formula gives you the LR value, and you use df when you look it up in the chi² distribution, to compute the p-value
classification table
We compute what percentage of observations are correctly predicted by the full model and compare this to the null model (i.e. , a model that assigns all observations to the most frequent outcome).
Example Cnfusion Matrix:
Although we lose some true positives we gain substantially more true negatives.
Interpretation of coefficients
𝛽 > 0 and p-value below T significance level
→ positive influence on P(Y=1)
𝛽 < 0 and p-value below significance level
→ negative influence on P(Y=1)
p-value above significance level
→ we cannot reject H0
Interpretation of the following Logit regression result, Upper part
Dependent variable: premSub (Premium subscription, likely 0 = no, 1 = yes).
Pseudo R² = 0.029 → This is very low. It means the model explains only ~3% of the variation in subscription behavior. Logistic regression R² values are usually lower than in linear regression, but this still suggests weak predictive power.
Log-Likelihood: −13437 (full model).
LL-Null: −13840 (null model, only intercept).
LLR p-value = 3.526e−174 → Extremely small → the model as a whole fits significantly better than the null model (so at least one predictor matters).
Interpretation of the following Logit regression result, Lower part
Intercept: Coefficient: −1.024 (significant, p < 0.001).
Income: Coefficient: 1.206e−05 (very small but highly significant, p < 0.001).
Avgterage Hours Watched and Satisfaction: not significant p-value > 0.05
Odds ratio
Interpretation of regression results becomes simpler if not p(y=1) is considered but the “odds”:
Interpretation:
OR = 1: No effect. The odds of the outcome occurring are the same, regardless of the value of the predictor.
OR > 1: Positive association. A one-unit increase in the predictor variable increases the odds of the outcome occurring. For example, an OR of 1.5 means that for every one-unit increase in the predictor, the odds of the outcome increase by 50%.
OR < 1: Negative association. A one-unit increase in the predictor variable decreases the odds of the outcome. For example, an OR of 0.5 means that for every one-unit increase in the predictor, the odds of the outcome decrease by 50%.
-> the odds increase by 49% for each unit increase in x_1.
Interpret the following result:
Model Fit:
PremSum is binary -> either 0 or 1 -> we apply logistic function -> Logit regression result
Pseudo R² = 0.029
→ The predictors explain about 2.9% of the variation in the log-odds of subscribing. This is weak explanatory power.
Log-Likelihood
Null model: -13840
Fitted model: -13437
→ Improvement is significantn (goal get closer to zero)
Likelihood Ratio (LR) test p-value = 3.5e-174
→ The model overall is statistically significant (at least one predictor helps explain premSub).
Coefficients:
Intercept = -1.0240, p < 0.001
→ When all predictors = 0, the log-odds of subscribing is -1.02 (probability < 0.5).
Income = 1.206e-05, p < 0.001
→ Statistically significant.
→ Each unit increase in income increases the log-odds of subscribing by 0.00001206.
→ Since income is probably measured in whole currency units, this is tiny per unit, but grows with scale.
→ In odds ratio terms: e^{0.00001206} ≈ 1.000012.
A 10,000-unit increase in income → odds of subscribing increase by about 12%.
avgHoursWatched = 0.0137, p = 0.194
→ Not statistically significant.
→ Watching more hours does not meaningfully predict premium subscription.
Satisfaction = 0.0083, p = 0.501
→ Satisfaction scores also do not explain subscription behavior here.
Confidence Intervals:
Income CI = [1.1e-05, 1.31e-05] → clearly positive, reinforcing significance.
Hours watched CI includes 0 → not significant.
Satisfaction CI includes 0 → not significant.
Last changed13 days ago