07 Regression

Buffl

LMU

by Emily P.

An introduction to regression

correlation => measures relationship between 2 variables
Regression => predict one variable from another
Regression analysis: we fit a model to our data and use it to predict values of the dependent variable from one or more independent variables
Outcome = (model) + error
Mathematical technique: method of least squares

Regression Line

Stope (gradient) of the line = b1
The point at which the line crosses the vertical axis of the graph = intercept = b0
Y = outcome we want to predict
X = particpants score on the predictor variable
b1 and b0 are the regression coefficients
Residual term (e) = represents differences between score predicted by the line for participant 1 and the score that participant 1 actually obtained

=> positive gradient => positive relationship

=> negative gradient => negative relationship

Method of least squares

Interested in the vertical differences between the line and the actual data because the line is our model —> we use it to predict values from Y from values of the X variable
In regression these differences are usually called residuals rather than deviations

Assessing the goodness of fit

deviation = Sum (observed - model) ^2

compare data to most basic model we can find
using equation to calculate the fit of the most basic model, then fit of the best model —> if best model is any good then it should fit the data significantly better than our basic model

=> sum of squared residuals = value represents the degree of inaccuracy when the best model is fitted to the data (SSr)

Sum of residual squares

—> The improvement in prediction resulting form using the regression model rather than the mean is obtained by calculating the differences between SSt and SSr

=> this difference shows us the reduction in the inaccuracy of the model resulting from fitting the regression model to the data => improvement is the model sum of squares (SSm)

SSm large => then regression model is very different from using the mean to predict the outcome variable —> implies the regression model made a big improvement on how well the outcome variable can be predicted
SSm small, then using the regression model is a little better than using the mean

Improvement of the model

=> R^2 represents the amount of variance in the outcome explained by the model (SSm) relative to how much variation there was to explain in the first place (SSt)

=> therefore, as a percentage, it represents the percentage of the variation in the outcome that can be explained by the model

R^2 = SSm / SSt

F-Ratio

=> F is based upon the ratio of improvement due to the model (SSm) and the differences between the model and the observed data (SSr)

=> measure of how much the model has improved the prediction of the outcome compared to the level of inaccuracy of the model

=> good model should have a large F-raito (greater than 1 at least)

Assessing individual predictors

=> b1 = gradient of the regression line

value of b represents the change in the outcome resulting from a unit change in the predictor
regression coefficient of 0 means
- a unit change in the predictor results in no change in the predicted value of the outcome
- the gradient of the regression line is 0, meaning that the regression line is flat
Looking for gradients > 0
- t-Test tests the H0 that the value of b is 0
- if it is significant, we gain confidence in the hypothesis that the b-value is sign. different from 0 and the predictor variable contributes sign. to our ability to estimate values of the outcome

ANOVA / Model parameters

ANOVA tells us whether the model (overall) results in a significantly good degree of prediction of the outcome variable
HOWEVER —> ANOVA doesn´t tell us about the individual contribution of variables in the model

Multiple Regression

—> each predictor has its own coefficient and the outcome variable is predicted from a combination of all the variables multiplied by their respective coefficients plus a residual term

Multiple R^2: square of the correlation between the observed values of Y and the values of Y predicited by the model

—> large R2 represents a large correlation between predicted and observed values of the outcome

Measure of fit

—> The big problem with R2 is, that when you add more variables, it will always go up

AIC (akaike information criterion)

=> measure of fit which penalize the model for having more variables (like adjusted R2)

large value of AIC indicates worse fit, corrected for the number of variables

Outliers and resiudals

an outlier is a case that differs substaintially from the main trend of the data
outliers can cause your model to be biased because they affect the values of the estimated regression coeffient

Checking assumptions

Variable types (quantitative or categorical)
Non-zero variance (predictors should have some variance)
no perfect multicolinearity (=> precitor variables should not correlate too highly)
homoscedasticity —> at each level of the predictor variable, the variance of the residual terms should be constant
independent errors (uncorrelated)
normally distributed errors
independence
linerarity

Multicollinearity

exists when there is a strong correlation between two or more predictors in a regression model
poses a problem only for multiple regression
perfect colinearity exists when at least one predictor is a perfect linear combination of the cothers
if colinearity increases, 3 problems arise
- Unstrustworthy b´s
- Limits the size of R
- importance of predictors

=> one way of identifying multicolinearity is to scan a correlational matrix (r abouve .8 - .9 are concerning)

Join Course

Preview

Author

Emily P.

Information

Last changed
2 years ago

Report course