remove NAs in df
dt <- dt[!is.na(value)]
What’s the difference between a histogram and a barplot? Give an example of (sketch) each of them.
# Histogram is used to show the distribution of a variable (mostly continuous),
# while barplot is used to compare between groups.
# In histogram, each bar represent a group of binned quantitative data,
# while in barplot, each bar represent a discrete category.
What’s the difference between spearman rank correlation and pearson correlation? When to use which?
# Pearson correlation measures the linear relationship of two continues variables, while spearman rank correlation measures the monotonic relationships between two continuous or ordinal variables.
Use pearson correlation when the two variables to compare are linearly related, otherwise use spearman correlation.
Explain two plots you would use to compare two continuous sample distributions
# 1. Boxplot, with computed statistics (quantiles, median, outliers)
# 2. Violin plot
# 3. Histogram or density plot
# 4. Ecdf plot
K-means steps
Choose the K initial centroids (one for each cluster). Different methods such as sampling random observations are available for this task.
Assign each observation xi to its nearest centroid by computing the Euclidean distance between each observation to each centroid.
Update the centroids μk by taking the mean value of all of the observations assigned to each previous centroid.
Repeat steps 2 and 3 until the difference between new and former centroids is less than a previously defined threshold.
assumptions when performing k-Means
The number of clusters K is properly selected
The clusters are isotropically distributed, i.e., in each cluster the variables are not correlated and have equal variance
The clusters have equal (or similar) variance
The clusters are of similar size
normalized data
A major limitation of the K-means algorithm
it relies on a predefined number of clusters.
Hierarchical clusterung
We describe bottom-up (a.k.a. agglomerative) hierarchical clustering:
Initialization: Compute all the n(n−1)/2�(�−1)/2 pairwise dissimilarities between the n� observations. Treat each observation as its own cluster. A typically dissimilarity measure is the Euclidean distance. Other dissimilarities can be used (1-correlation), Manhattan distance, etc.
For i=n,n−1,...,2�=�,�−1,...,2:
Fuse the two clusters that are least dissimilar. The dissimilarity between these two clusters indicates the height in the dendrogram at which the fusion should be placed.
Compute the new pairwise inter-cluster dissimilarities among the i−1�−1 remaining clusters using the linkage rule.
The linkage rules define dissimilarity between clusters. Here are four popular linkage rules:
Complete: The dissimilarity between cluster A and cluster B is the largest dissimilarity between any element of A and any element of B.
Single: The dissimilarity between cluster A and cluster B is the smallest dissimilarity between any element of A and any element of B. Single linkage can result in extended, trailing clusters in which single observations are fused one-at-a-time.
Average: The dissimilarity between cluster A and cluster B is the average dissimilarity between any element of A and any element of B.
Centroid: The dissimilarity between cluster A and cluster B is the dissimilarity between the centroids (mean vector) of A and B. Centroid linkage can result in undesirable inversions.
Differences k-mean and hierarchial clustering
The time complexity of K-Means clustering is linear, while that of hierarchical clustering is quadratic.
In K-Means clustering, we start with a random choice of centroids for each cluster. Hence, the results produced by the algorithm depend on the initialization.
K-Means clustering requires the number of clusters a priori.
Rand index
The Rand index is a measure of the similarity between two partitions.
PCA def
For Principal Component Analysis (Pearson, 1901), this representation is the projection of the data on the subspace of dimension q that is closest to the data according to the sums of the squared Euclidean distances.
P-value
The P-value is the probability of obtaining a test statistic the same as or more extreme than the one we actually observed, under the assumption that the null hypothesis is true.
The formal definition of the P-value depends on whether we take “more extreme” to mean greater, less, or either way:
For a univariate data these are histograms, single boxplots or violin plots. For multivariate data these are clustered heatmaps, PCA projections, etc.
Associative plots show how a variable depends on another variable.
Suitable plots are side-by-side boxplots, scatter plots, etc.
Elementary causal diagrams
This phenomenon, where a variable X seems to relate to a second variable Y in a certain way, but flips direction when the stratifying for another variable Z (grouping by Z), has since been referred to as Simpson’s paradox.
What does Reverse causality state?
Reverse causality states that if A and B correlate one might falsely interpret that A causes B even though in reality B causes A
Case resampling bootstrap
We take a sample of size n, with replacement, from our observed data, to make a new dataset.
confidence interval
A confidence interval of confidence level 1−α for a parameter θ is an interval C=(a,b) which would the data generation process be repeated, would contain the parameter with probability 1−α, i.e. p(θ∈C)=1−α. A typical value is α=0.05 which leads to 95% confidence intervals.
1. Let p be the (true) probability that a coin lands on heads. You flip the coin n times. Using the resulting data, you compute an estimate for p, which we call pˆ, and a valid 95% confidence interval. This confidence interval is [0.48, 0.52]. Indicate which, if any, of the following statements are correct:
a) There is a 95% chance that 0.48 ≤ p ≤ 0.52
b) Assuming the null hypothesis is correct, there is a 95% chance that 0.48 ≤ p ≤ 0.52
c) Assuming the null hypothesis is incorrect, there is a 95% chance that 0.48 ≤ p ≤ 0.52
d) If we were to repeat the experiment 100 times, and compute a confidence interval for each replicate, we expect that only around 5 of the computed confidence intervals will not contain p
(a) is wrong. p is not random and thus it either is in the interval or it is not.
(b) and (c) are wrong for the same reason as (a). Moreover, we did not even specify a null hypothesis This is not necessary to compute a confidence interval.
(d) is correct. In fact, this is a restatement of the definition of a 95% confidence interval
Consider the following (extreme) 2x2 contingency table and assume we want to test the association of taking antiviral medicine with having symptoms from a viral disease. Indicate which, if any, of the following statements are correct:
a) A Fisher’s test, with alternative = "two.sided", applied to this table will return a low P-value
b) A Fisher’s test, with alternative = "greater", applied to this table will return a low P-value
c) A Fisher’s test, with alternative = "less", applied to this table will return a low P-value
(a) is correct
(b) is wrong. The alternative=greater
# asks whether the upper left cell
# is significantly bigger than we expect
# This is not the case here
(c) is correct
Correct or wrong?
The t-statistic is defined based on the difference in median between groups
is wrong. It depends on the difference in mean
If one of the groups follows a bimodal distribution, then the t-test can still be applied without issue, because the t-statistic does not depend on the mode
is wrong. Bimodality is a deviation from normality # In a bimodal distribution, the mean might not be meaningful
If both groups follow a normal distribution, the t-test will be more powerful (i.e. more likely to detect deviations from the null) than the Wilcoxon
Correct
If the Wilcoxon test returns a very large P-value, e.g. P > 0.9999, then we can conclude that P(X > Y ) = P(Y > X), i.e. the ranks of our two groups follow the same distribution.
is wrong. We never accept the null! # There may be a difference # But we may have too little data # Or too much noise to detect it
What’s the devil’s advocate?
X and Y come from the same distribution, i.e. values do not depend on the group.
two quantitatives: X and Y do not correlate
multiple testing
multiple testing refers to the issue that, if we test enough hypotheses at a given significance level, say α=0.05, we are bound to eventually get a significant result, even if the null hypothesis is always true.
Family-wise error rate (FWER)
Family-wise error rate (FWER): p(V>0)�(�>0), the probability of one or more false positives
What does the Benjamini-Hochberg correction control?
The false discovery rate
What is the effect of controlling the Family-wise error rate?
Controlling the Family-wise error rate ensures we have few false positives, but it comes at the cost of many false negatives.
False discovery rate
the expected fraction of false positives among all discoveries:
, where max(R, 1) ensures the denominator to not be 0.
Overview p-value, bonferoni, benjamini hochberg
As expected, we see that the nominal P−value cutoff is the most lenient, the FWER one (Bonferroni) the most stringent and the FDR ones (Benjamini-Hochberg) is intermediate.
Why do we need to be careful when we apply hypothesis testing procedures in a big data context?
As the sample size increases, even very small effects may become significant, but that does not mean that they actually matter.
Moreover, we saw that when we run many tests, some are bound to reject the null, even if the null is always true. We thus need to apply a correction.
Effect size
The effect size is a quantitative measure of the magnitude of a phenomenon. Examples of effect sizes are the correlation between two variables, the regression coefficient in a regression, the mean difference, or even the risk with which something happens, such as how many people survive after a heart attack for every one person that does not survive.
The effect size determines whether an effect is actually important.
Assume the null hypothesis is always true. If we do 200 tests, we will expect to have around … false positives when using α = 0.05 as our threshold of significance
10
Assume we are using a permutation-based approach. If we use a Bonferroni correction, we will in general need to do more permutations to be able to reject the null than if we used a Benjamini-Hochberg correction.
is correct.
# The bonferroni is generally
# more conservative # Thus it demands lower P-values to reject
# Lower P-values require more permutations
Assume we are doing 1000 tests. If we let α = 0.01 and use a Bonferroni correction, then the probability of one or more false positives (falsely rejecting the null) will be …?
less than 1%
Assume we are doing 1000 tests. If we let α = 0.01 and use a Benjaminin-Hochberg correction, then in expectation 1% of the tests we perform will reject the null
wrong. We expect that less than 1% of the positives will be false positives
linear regression: The conditional expectation of y
Linear regression can be used for various purposes:
To test conditional dependence. This is done by testing the null hypothesis:
H0:βj=0.
To estimate the effects of one variable on the response variable. This is done by providing an estimate of the coefficient βj.
To predict the value of the response variable given values of the explanatory variables. The predicted value is then an estimate of the conditional expectation E[y|x].
To quantify how much variation of a response variable can be explained by a set of explanatory variables.
coefficient of determination or R2
The proportion of variance explained by the model
The assumptions of linear regressions are:
The expected values of the response are a linear combinations of the explanatory variables.
Errors are identically and independently distributed.
Errors follow a normal distribution with constant variance across all explanatory variable values
An implication that the errors follow a Gaussian distribution is that
the residuals also follow a Gaussian distribution. The Q-Q plot supports such distribution.
Interpret the plot from above. How can we explain the different slopes of the two linear models and the pca?
# Linear models minimize the RSS along the predicted coordinate.
# In this specific case, such coordinate is either the male student's height (first model of the students heights) or the fathers height (second model predicting the fathers heights).
# PCA minimizes the sum of squares (along both coordinates). And hence, the distance perpendicular to the first principal component. PC1 is the line with respect to which the sum of
# PC1 represents the line such that the sum of the squared distances of every point (x,y) to it is the minimum across all possible lines
sensitivity
The sensitivity refers to the fraction of actual positives that is predicted to be positive:
The sensitivity is also referred to as “recall,” “true positive rate,” or “power.”
specificity
The specificity refers to the fraction of actual negatives that is predicted to be negative:
The specificity is also known as “true negative rate” or “sensitivity of the negative class”
precision
The precision refers to the fraction of predicted positives that are indeed positives:
The precision is also called the positive predictive value. Note that, in the hypothesis testing context, we discussed a related concept, the false discovery rate (FDR). The FDR relates to the precision as follows:
The receiver operating characteristic curve or ROC curve is a way of evaluating the quality of a binary classifier at different cutoffs. It describes on the x axis the false positive rate (1-specificity), and on the y axis the true positive rate (sensitivity).
Which of the following is true for logistic regression?
a) It assigns classes to the datapoints.
b) There is an analytical solution for the estimation of the parameters.
c) It predicts probabilities for each of the two classes.
c
Suppose you are given a fair coin, p(heads) = 0.5. Which of the following are true about odds and log-odds of head?
a) The odds are 0, and the log-odds are 1.
b) The odds are 0.5, and the log-odds are approximately -0.693.
c) The odds are 1, and the log-odds are 0.
d) The odds are 1, and the log-odds are 1.
odds
The odds for a binary variable y are defined as
he specific steps for building a random forest can be formulated as follows:
residual
The residual is the difference between the observed value and the estimated value of the quantity of interest (for example, a sample mean)
error
The error of an observation is the deviation of the observed value from the true value of a quantity of interest (for example, a population mean).
residual sum of squares
What is the difference between the students-t-test and the welch test?
The welch test assumes unequal variance
Wilcox rank idea
The idea of the the Wilcoxon rank-sum test is that under the null hypothesis, the xi’s and yi’s should be well interleaved in this ranking.
It shows the residuals epsilon against some predicted values yˆ. N = 100. Which of the assumptions of the model doesn’t hold?
How would you solve this problem?
Variance is not constant: heteroscedascity
# transformation of the response y
# - log transformation
# - square root transformation
# - variance stabilizing transformation
# Use a generalized linar model
Linearity
# investigate further terms e.g. x**2
Why is a boxplot not suitable for some data? Name two reasons
boxplots are not well suited for bimodal data, since they only show one mode (the median)
Boxplots are also not suited for categorical data and discrete data with very few values, for which bar plots are preferred
Last changed2 years ago