An introduction to regression
correlation => measures relationship between 2 variables
Regression => predict one variable from another
Regression analysis: we fit a model to our data and use it to predict values of the dependent variable from one or more independent variables
Outcome = (model) + error
Mathematical technique: method of least squares
Regression Line
Stope (gradient) of the line = b1
The point at which the line crosses the vertical axis of the graph = intercept = b0
Y = outcome we want to predict
X = particpants score on the predictor variable
b1 and b0 are the regression coefficients
Residual term (e) = represents differences between score predicted by the line for participant 1 and the score that participant 1 actually obtained
=> positive gradient => positive relationship
=> negative gradient => negative relationship
Method of least squares
Interested in the vertical differences between the line and the actual data because the line is our model —> we use it to predict values from Y from values of the X variable
In regression these differences are usually called residuals rather than deviations
Assessing the goodness of fit
deviation = Sum (observed - model) ^2
compare data to most basic model we can find
using equation to calculate the fit of the most basic model, then fit of the best model —> if best model is any good then it should fit the data significantly better than our basic model
=> sum of squared residuals = value represents the degree of inaccuracy when the best model is fitted to the data (SSr)
Sum of residual squares
SSt
SSr
SSm
—> The improvement in prediction resulting form using the regression model rather than the mean is obtained by calculating the differences between SSt and SSr
=> this difference shows us the reduction in the inaccuracy of the model resulting from fitting the regression model to the data => improvement is the model sum of squares (SSm)
SSm large => then regression model is very different from using the mean to predict the outcome variable —> implies the regression model made a big improvement on how well the outcome variable can be predicted
SSm small, then using the regression model is a little better than using the mean
Improvement of the model
=> R^2 represents the amount of variance in the outcome explained by the model (SSm) relative to how much variation there was to explain in the first place (SSt)
=> therefore, as a percentage, it represents the percentage of the variation in the outcome that can be explained by the model
R^2 = SSm / SSt
F-Ratio
=> F is based upon the ratio of improvement due to the model (SSm) and the differences between the model and the observed data (SSr)
=> measure of how much the model has improved the prediction of the outcome compared to the level of inaccuracy of the model
=> good model should have a large F-raito (greater than 1 at least)
Assessing individual predictors
=> b1 = gradient of the regression line
value of b represents the change in the outcome resulting from a unit change in the predictor
regression coefficient of 0 means
a unit change in the predictor results in no change in the predicted value of the outcome
the gradient of the regression line is 0, meaning that the regression line is flat
Looking for gradients > 0
t-Test tests the H0 that the value of b is 0
if it is significant, we gain confidence in the hypothesis that the b-value is sign. different from 0 and the predictor variable contributes sign. to our ability to estimate values of the outcome
ANOVA / Model parameters
ANOVA tells us whether the model (overall) results in a significantly good degree of prediction of the outcome variable
HOWEVER —> ANOVA doesn´t tell us about the individual contribution of variables in the model
Multiple Regression
—> each predictor has its own coefficient and the outcome variable is predicted from a combination of all the variables multiplied by their respective coefficients plus a residual term
Multiple R^2: square of the correlation between the observed values of Y and the values of Y predicited by the model
—> large R2 represents a large correlation between predicted and observed values of the outcome
Measure of fit
—> The big problem with R2 is, that when you add more variables, it will always go up
AIC (akaike information criterion)
=> measure of fit which penalize the model for having more variables (like adjusted R2)
large value of AIC indicates worse fit, corrected for the number of variables
Outliers and resiudals
an outlier is a case that differs substaintially from the main trend of the data
outliers can cause your model to be biased because they affect the values of the estimated regression coeffient
Checking assumptions
Variable types (quantitative or categorical)
Non-zero variance (predictors should have some variance)
no perfect multicolinearity (=> precitor variables should not correlate too highly)
homoscedasticity —> at each level of the predictor variable, the variance of the residual terms should be constant
independent errors (uncorrelated)
normally distributed errors
independence
linerarity
Multicollinearity
exists when there is a strong correlation between two or more predictors in a regression model
poses a problem only for multiple regression
perfect colinearity exists when at least one predictor is a perfect linear combination of the cothers
if colinearity increases, 3 problems arise
Unstrustworthy b´s
Limits the size of R
importance of predictors
=> one way of identifying multicolinearity is to scan a correlational matrix (r abouve .8 - .9 are concerning)
Last changed2 years ago