Which kind of variables exist?
Categorical
quantitative
Kendall's coefficient
= (Number of concordant pairs-number of discordant pairs) / total number of unique pairs (n)
How are categorical variables with only two categories called?
binary
dummy variables takes either 0 or 1
What are ordinal variables?
Ordinal variables can be ranked (unlike categorical variables) but have no unit (unlike quantitative/numerical variables).
Which graphs can we use for categorical variables?
pie chart
barplot
What is (dis-)advantage of pie chart?
Pie charts may be better if the audience has less experience with statistics. Bar plots are often preferred by a technically skilled audience. In a bar plot, it is easier to see which group is larger, especially if the bars are arranged in order of size.
Which graphs for quantitative variables?
histogram/density plot
boxplot
Which information can we read out of a histogram?
mode
symmetry (skewness)
outliers
What does one need to do with outliers?
need to be investigated, may exist due to error in the data
if removed it needs to be documented
can have signficant effect on analysis
How do we measure centre of a distribution?
median
mean
How do we measure spred of a distribution?
IQR interquartile range
standard deviation (variance)
What does a mosaic plot do?
A mosaic plot provides a more complete picture of a joint
distribution than a regular bar chart. Each rectangle has an area that corresponds to the proportion of observations that the rectangle represents.
how can one illustrate two categorical variables?
mosaic plot
bar plots
What is a hidden (lurking) variable? Example?
variables that might explain relationship but are not part of the dataset
e.g. If there is a relationship between more television sets in a country and higher life expectancy, is it because watching television is beneficial or is there some other underlying variable at play?
What is Simpson’s paradox?
Simpson’s paradox means that a relationship between two
variables can disappear when the dataset is divided into different groups.
Which measures are visible in a boxplot?
Q1,
Q3,
IQR
min
max
outliers.Q1 − 1.5 · IQR and Q3 + 1.5 · IQR,
Explain Q1, Q2, Q3
Q1: A value that is greater than 25% of the observations and less than the remaining 75% of the observations.
• Q2: A value that is greater than 50% of the observations and less than the remaining 50% of the observations. Q2 is the same as the median.
• Q3: A value that is greater than 75% of the observations and less than the remaining 25% of the observations.
What is a frequency table?
A frequency table shows the number of observations in each category.
What is a contingency table?
A cross table (contingency table) shows the relationship
between two variables
What is a joint distribution?
A joint distribution divides observations into groups based on two or more variables:
What is a marginal distribution?
The marginal distribution of a categorical variable shows the number of observations per category without considering the other variable.
What is a conditional distribution?
What does the z-score measure?
The Z-value (z-score) measures the deviation from the mean for a variable, expressed in the number of standard deviations.
What is 1 and 0 in z-score
standard deviation
How much % does are under histogram represent?
The area under the normal distribution curve represents 100% of our observations.
Which names exist for x and y- vairbales?
There are several different names used for the x- and y-variables.
• The y-variable is sometimes called the response variable.
• The x-variable is sometimes called the explanatory variable.
• Another common name for the y-variable is the dependent variable, and the x-variable is then called the independent variable.
• Other commonly used names for the x-variable are predictor and covariate.
• In machine learning, explanatory variables are called features.
For what do we use correlation coefficient?
• The measure we use to assess the linear relationship is called the correlation coefficient
Which values can the correlation coefficient be?
The correlation is always a number between −1 and 1.
Footnote (1). If r is close to 1 or −1, it indicates a strong
correlation. If r is close to 0, it indicates a weak correlation.
What does a hidden variable mean? and give me an example
With some relationships, it may also be the case that there is no causality in either direction. Instead, there may be a third variable that explains both x and y. Such an invisible variable is called a lurking variable.
None of the causal interpretations are particularly good, but we have an obvious lurking variable. The season affects both the number of drownings and the production of ice cream. People swim more and eat more ice cream during the summer months.
What does predict mean?
To predict means to estimate a value when we cannot make a direct observation (i.e., when we cannot perform a measurement).
what does ^y mean?
Writing ˆy instead of y indicates that it is an estimate, not the actual value of y.
What does e stand for and how is it calculated?
In the formula e = y − ˆy, e stands for the residual. It is ”e”
as in ”error”.
What is the goal of the regression line?
Our goal is to find a regression line with the smallest possible residuals.
What is the least square method?
The straight line that minimizes sqrt(e^2) is considered the best
regression line.
What does it mean when a result is signficant or not significant?
When we say a relationship is not significant, we mean the relationship in our data could very well be caused by chance.
When we say a relationship is significant, we mean the relationship in our data is most likely not a result of randomness.
Which are the main assumptions in a regression analysis?
residuals are normally distributed
residuals have a constant variance
linear relationship
What does rsquare do?
R2 indicates how well the model explains the variation in the response variable.
Which concepts do we need to calculate Rsquare?
SST (Sum of Squares Total): The total variation in the response variable.
SSE (Sum of Squares Error): The variation that is not
explained by the model.
What is r square in a simple linear regression analysis?
For simple linear regression, it holds that R2 = r2, where r is
the correlation coefficient for the relationship between the
response variable and the explanatory variable.
What does SSR mean?
A third concept is SSR (Sum of Squares Regression), which
measures the variation of ˆy around the mean value ¯y of the variable.
What is the relationship between SST, SSR and SSE
• The relationship between SST, SSR, and SSE is:
SST =SSR+SSE
Turkey’s Circle
What happens to Rsquare when more variables are added in a regression analysis
The fact is that R2 always becomes larger when we add
additional variables.
What does adjusted Rsquare do in relation to Rsquare?
One strategy to determine which variables should be included,can be to maximize the measure called adjusted R2. This measure increases with R2, but decreases at the same time with the number of explanatory variables.
what is important to remember when interpreting coefficients of a multiple regression analysis?
In multiple linear regression, we can say that the relationship between the response variable and an explanatory variable is conditional on the other explanatory variables.
What is overfitted?
• We say that a flexible model with poor generalizability is
overfitted (overfitted).
What is underfitted?
An inflexible model that does not capture the pattern in our
data is underfitted.
What is a good method to asses generalisability?
A good way to assess if a model is generalizable is simply to
test how well its predictions perform on new data.
• To do this, we split our dataset into two parts:
• One part is training data. This is used to fit the model, i.e.,
to find the values of the model’s coefficients.
• The other part is test data. This is used to evaluate the
model.
What is the procedure for evaluating a model on test data?
The procedure for evaluating a model on test data is:
1. Split the observations into training data and test data.
2. Fit the regression model using the training data.
3. Evaluate the model using the test data.
Whcih measure is used for test data to check how good model is?
RMSE root mean squared error
What is the advantage of cross validation?
Cross-validation is a method for evaluating models that allows us to use all observations as both training data and test data.
Key advantages:
• Reduces variability in performance estimates compared to single train-test splits
• Maximizes data usage - all observations contribute to both training and evaluation
What is the procedure in cross validation?
• Procedure:
1. Split the dataset into training and test data multiple times.
2. Each split uses a new set of observations as test data.
3. Every observation becomes part of the test data in exactly one
split and part of the training data in the others.
How is each split in cross validation called?
• Each split is called a fold.
What does concordant mean?
The bigger value in one variable also has the bigger value in the other variable. (they move together)
What is a measure of variability?
A measure of variability tells you how spread out or scattered the values of a variable are. In other words, it shows how much the values differ from each other.
What is a robust parameter?
A robust parameter is a statistic that doesn’t change much when the data has outliers or weird values.
Zuletzt geändertvor 6 Tagen