Seperate data into two datasets e.g. dummy variables with 0 and 1 as values
> ifemale <- which(data$group==0)
> data_female <- data[ifemale,]
> imale <- which(data$group==1)
> data_male <- data[imale,]
Check residuals vs. leverage plot
rule of thumb (Cook’s distance)
preparation (code)
identification of leveraged observations
Rule of thumb: observation has high influence if Cook’s distance exceeds 4/(n - p - 1)
4/(n - p - 1)
n <- dim(data)[1]
p <- dim(data)[2]-1
plot(linreg_clean) -> eventhough points are most leveraged, it does not mean they are outliers
plot(linreg_clean)
Anova
Wofür?
Formel
Interpretation
Multicollinearity
anova(reg)
Depending on siginficance level
Variable is significant
Variable is not significant
Jarque Bera Test
Normality
jarque.bera.test(summary(reg)$residuals)
Hypothesis test
H0: Errors follow a normal distribution
H1: Errors dont follow a normal distribution
F-Test
var.test(x,y,data)
H0: population variances are equal
H1: population variances are not equal
Durbin Watson Test
Autocorrelation
durbinWatsonTest(reg)
Hypothesis tests
H0: residuals are not autocorrelated
H0:
H1: residuals are positively autocorrelated
H1:
Breusch-Pagan Test
Heteroscedasticity
bptest(reg)
H0: data is not heterocedastic
H1: data is heteroscedastic
Boxplot
boxplot(data)
Correlation Formulas
Correlation Matrix
Correlation Plots
cor(data)
pairs(data) <- pairwise correlation plots
VIF
vif(reg)
value that exceeds 5 or 10 indicates a problematic amount of collinearity
VIF barplot incl. line
barplot(vif(mreg), horiz = TRUE)
abline(v = 5, lwd = 3, lty = 2)
v: add vertical line at x = 5
v:
x = 5
lwd: define linewidth
lwd:
lty: define line type
lty:
clean data from outliers
Option: Boxplot
hout <- boxplot(data)$out
iout <- which(data$variable==hout)
data_clean <- data[-iout,]
Option: Scatterplot - Maximum point
iout <- which.max(data$variable)
dataclean <- data[-iout,]
stepAIC
Formel (incl. arguments)
Variable selection
stepAIC(reg, direction="backward")
backward: removes predictors sequentially from the given model with decreasing complexity
backward:
forward: adds predictors sequentially to the given mode with increasing complexity
forward:
both: a forward-backward search that, at each step, decides whether to include or exclude a predictor
both:
last displayed model shows all significant variables and removed variables, that do not improve the validity of the dataset
Steps for Transformation
Transform data & plot in transformed space
> logreg=lm(log(sales)∼log(price),data)
> plot(log(data$price),log(data$sales), pch = 16, col = "blue") > abline(linreg,col="red")
> plot(log(data$price),log(data$sales), pch = 16, col = "blue")
> abline(linreg,col="red")
Plot in original space
> plot(data$price,data$sales, pch = 16, col = "blue")
> x=seq(from=min(data$price),to=max(data$price),by=0.01)
> y=predict(logreg,list(price=x))
> matlines(x,y,col="red")
Plot the residuals
Analyize R²
Color Coding
plot(x,y,col=data$x2+3)
> data$x2: data which should be seperated by color
Confidence interval with predicted variables
How would the arguments change if we would predict the confidence interval?
> predict(lm,newdata <- data.frame(income=50, competitor=3,
mallvisitors=2000),interval=’confidence’)
mallvisitors=2000),interval=’prediction’)
CI plot
formula
what does it mean if observations are outside the prediction interval?
ci.plot(lm,conf.level=0.90)
Prediction is too optimistic
Name the Seven Assumptions
Linearity
Strict Exogenitiy
No Correlation of X and the Errors
No Correlation of the Errors
No Perfect Multicollinearity
Multicollinearity - Tests
Non-Normality of the Error Terms - Tests
analyze the QQ-Plot
Jarque-Bera Test
Heteroscedasticity - Tests
Auto Correlation - Tests
Last changed2 years ago