Explain what “missing data” means and when does it typically occurs?
missing data = information not available for a subject while other information is available. It occurs when respondent fails to answer one or more questions in a survey
What are the two types of missing data?
Systematic missing data (e.g. a survey asks how much money somebody earns => not everybody will answer these types of private questions => systematic missing data)
Random missing data
What is the impact of missing data
can distort results
Reduce sample size available for analysis
Name the 4-step-process for identifying missing data
determine the type of missing data
Determine the extend of missing data
Diagnose the randomness of the missing data
Select the imputation method
When you determine missing data, there a two types of missing data - name and explain them
ignorable missing data
Questionnaire design - not all questions are answered (=> question about children, but you dont have kids => you dont answer them)
Not ignorable missing data
Errors in data entry (known process)
Refusal to respond to certain questions (unknown process)
Name the thumb rules regarding determine the extend of missing data
missing data under 10% for an individual case or observation can generally be ignored
Number of cases with no missing data must be sufficient for the selected analysis technique if replacement values will not be substituted for the missing data
What are the two types of missing data if you diagnose the randomness if the missing data - name them, explain them and give an example
missing at random (MAR)
Missing values of Y depend on X, but not on Y
Example: Y = income // X = gender => Missing data for income are random for male and female, b it occur at a much higher frequency for males than females. As Males may earn more than females, observed Y in biased
Missing completely at random (MCAR)
missing values of Y are completely random => the observed values of Y are a random sample of all Y values
Which of MAR and MCAR are truly random missing data?
MCAR is truly random
Describe the diagnostic test for randomness of the missing data
partition your variable Y into 2 groups - missing and no missing
Compare the averages of these groups on other variables
Significant differences indicate the possibility of a nonrabndom missing data process
Name the 4 imputation methods for missing data
hot or cold deck imputation
Case substitution
Mean substitution
Regression imputation
Explain the hot or cold deck imputation
hot deck => value comes from another observation in the sample that is deemed similar
cold deck => value is obtainedfrom an external source
Explain the case substitution
Entire observation with missing data are replaced by choosing another nonsampled observation
Explain the mean substitution
Replaces missing values of a variable with mean value of that variable calculated from all valid observations
Explain the regression imputation
Regression analysis is used to predict the missing values
What are the rules of thumb when do you need which method for the impitutation
under 10% missing data
Any of the imputation methods can be applied
10% - 20% missing data
Increased presence of missing data makes the mean substitution not very useful
Over 20% missing data
Regression method for MCAR situation
Model based methods when MAR missing data occurs
What is an outlier
An observation / response with a unique combination of characteristics identifiable as distinctly different from the other observations / responses
What are the reasons for outliers and what is their effect on analyzes
occur due to procedural error
Or
extraordinary event or observation
=> can have a strong effect on analyzes
What kind of distributions are there in a multivariate analysis - explain them
normality
Metric variables are normally distributed
The statistic test is invalid if the distribution is sufficiently large because normality is required to use the t and F statistic
Nonnormality
The severity of nonnormality is based on the shape of the distribution and sample size, shape is determined by…
… Kurtosis:
“Peakedness” or “flatness” of distribution
Kurtosis value near zero = shape close to normal
+ value = distribution more peaked
- value = distribution flatter
Kurtosis Value +/- 1 = very good
Kurtosis Value +/- 2 = acceptable
…Skewness:
Balance of a distribution
Value = 0 => distribution is symmetric
Positiv skewness = greater number of smaller value
Negative value = greater number of larger values
Skewness value +/- 1 = very good
Skewness Value +/- 2 = acceptable
How does a positiv and a negative skewed distribution look like
What does “homoscedasticity” and “heteroscedasticity” mean?
Homoscedasticity
Fehler in einem Regressionsmodell sind immer gleich groß, egal wie groß die Werte der erklärenden Variablen sind
Bsp: wenn du vorhersagen willst wie viel Geld Leute basierend auf ihrer Arbeitseit verdienen, sollten die Abweichungen von der Vorhersage gleichmäßig bleiben - egal ob jemand 5h oder 50h die Woche arbeitet.
Heteroscedasticity:
Fehler in einem Regressionsmodell nicht immer gleich groß. Stattdessen größer oder kleiner, je nach dem, wo man in den Daten ist
Bsp: Wenn Leute mit höherem Einkommen viel unterschiedlichere Beträge verdienen als Leute mit niedrigerem EInkommen, dann sind die Fehler größer bei hohem Einkommen und kleiner bei niedrigem.
How can you test homoscedasticity
Can be tested graphically with a scatter plot or with a Box plot
can be tested statistically with the levene test
Levene Test is used to asses whether the variances of a single metric variable are equal across any number of groups
There is normality / homo- and heteroscedasticity and the last assumption is… => explain it
… linearity (straight line)
How does nonlinearity affect correlation and the outcomes of multivariate analysis
Correlations only represent linear relationships, nonlinear effects are non captured. This leads to an underestimation of actual strength of the relationship between variables if a nonlinear relationship exists
What should be done if a non linear relationship is detected in the data?
Data transformations should be applied to linearities the relationship, meeting the assumptions for multivariate techniques
Why are transformations used in multivariate analysis
correcting violationsof statistical assumptions underlying multivariate techniques
Improving the correlation between variables to enhance analysis accuracy
What do you have to do to get normality in a distribution
Last changed4 days ago