2_examining your data

by Immanuel F.

Explain what “missing data” means and when does it typically occurs?

missing data = information not available for a subject while other information is available. It occurs when respondent fails to answer one or more questions in a survey

What are the two types of missing data?

Systematic missing data (e.g. a survey asks how much money somebody earns => not everybody will answer these types of private questions => systematic missing data)
Random missing data

What is the impact of missing data

can distort results
Reduce sample size available for analysis

Name the 4-step-process for identifying missing data

determine the type of missing data
Determine the extend of missing data
Diagnose the randomness of the missing data
Select the imputation method

When you determine missing data, there a two types of missing data - name and explain them

ignorable missing data
- Questionnaire design - not all questions are answered (=> question about children, but you dont have kids => you dont answer them)
Not ignorable missing data
- Errors in data entry (known process)
- Refusal to respond to certain questions (unknown process)

Name the thumb rules regarding determine the extend of missing data

missing data under 10% for an individual case or observation can generally be ignored
Number of cases with no missing data must be sufficient for the selected analysis technique if replacement values will not be substituted for the missing data

What are the two types of missing data if you diagnose the randomness if the missing data - name them, explain them and give an example

missing at random (MAR)
- Missing values of Y depend on X, but not on Y
  - Example: Y = income // X = gender => Missing data for income are random for male and female, b it occur at a much higher frequency for males than females. As Males may earn more than females, observed Y in biased
Missing completely at random (MCAR)
- missing values of Y are completely random => the observed values of Y are a random sample of all Y values

Which of MAR and MCAR are truly random missing data?

MCAR is truly random

Describe the diagnostic test for randomness of the missing data

partition your variable Y into 2 groups - missing and no missing
Compare the averages of these groups on other variables
Significant differences indicate the possibility of a nonrabndom missing data process

Name the 4 imputation methods for missing data

hot or cold deck imputation
Case substitution
Mean substitution
Regression imputation

Explain the hot or cold deck imputation

hot deck => value comes from another observation in the sample that is deemed similar
cold deck => value is obtainedfrom an external source

Explain the case substitution

Entire observation with missing data are replaced by choosing another nonsampled observation

Explain the mean substitution

Replaces missing values of a variable with mean value of that variable calculated from all valid observations

Explain the regression imputation

Regression analysis is used to predict the missing values

What are the rules of thumb when do you need which method for the impitutation

under 10% missing data
- Any of the imputation methods can be applied
10% - 20% missing data
- Increased presence of missing data makes the mean substitution not very useful
Over 20% missing data
- Regression method for MCAR situation
- Model based methods when MAR missing data occurs

What is an outlier

An observation / response with a unique combination of characteristics identifiable as distinctly different from the other observations / responses

What are the reasons for outliers and what is their effect on analyzes

occur due to procedural error

extraordinary event or observation

=> can have a strong effect on analyzes

What kind of distributions are there in a multivariate analysis - explain them

normality
- Metric variables are normally distributed
- The statistic test is invalid if the distribution is sufficiently large because normality is required to use the t and F statistic
Nonnormality

The severity of nonnormality is based on the shape of the distribution and sample size, shape is determined by…

… Kurtosis:
- “Peakedness” or “flatness” of distribution
- Kurtosis value near zero = shape close to normal
- + value = distribution more peaked
- - value = distribution flatter
- Kurtosis Value +/- 1 = very good
- Kurtosis Value +/- 2 = acceptable
…Skewness:
- Balance of a distribution
- Value = 0 => distribution is symmetric
- Positiv skewness = greater number of smaller value
- Negative value = greater number of larger values
- Skewness value +/- 1 = very good
- Skewness Value +/- 2 = acceptable

How does a positiv and a negative skewed distribution look like

What does “homoscedasticity” and “heteroscedasticity” mean?

Homoscedasticity
- Fehler in einem Regressionsmodell sind immer gleich groß, egal wie groß die Werte der erklärenden Variablen sind
  - Bsp: wenn du vorhersagen willst wie viel Geld Leute basierend auf ihrer Arbeitseit verdienen, sollten die Abweichungen von der Vorhersage gleichmäßig bleiben - egal ob jemand 5h oder 50h die Woche arbeitet.
Heteroscedasticity:
- Fehler in einem Regressionsmodell nicht immer gleich groß. Stattdessen größer oder kleiner, je nach dem, wo man in den Daten ist
  - Bsp: Wenn Leute mit höherem Einkommen viel unterschiedlichere Beträge verdienen als Leute mit niedrigerem EInkommen, dann sind die Fehler größer bei hohem Einkommen und kleiner bei niedrigem.

How can you test homoscedasticity

Can be tested graphically with a scatter plot or with a Box plot

can be tested statistically with the levene test
- Levene Test is used to asses whether the variances of a single metric variable are equal across any number of groups

There is normality / homo- and heteroscedasticity and the last assumption is… => explain it

… linearity (straight line)

How does nonlinearity affect correlation and the outcomes of multivariate analysis

Correlations only represent linear relationships, nonlinear effects are non captured. This leads to an underestimation of actual strength of the relationship between variables if a nonlinear relationship exists

What should be done if a non linear relationship is detected in the data?

Data transformations should be applied to linearities the relationship, meeting the assumptions for multivariate techniques

Why are transformations used in multivariate analysis

correcting violationsof statistical assumptions underlying multivariate techniques
Improving the correlation between variables to enhance analysis accuracy

What do you have to do to get normality in a distribution

Join Course

Preview

Author

Immanuel F.

Information

Last changed
9 months ago

Report course