undefined

Buffl

DataViz

von Daniela W.

remove NAs in df

dt <- dt[!is.na(value)]

What’s the difference between a histogram and a barplot? Give an example of (sketch) each of them.

# Histogram is used to show the distribution of a variable (mostly continuous),

# while barplot is used to compare between groups.

# In histogram, each bar represent a group of binned quantitative data,

# while in barplot, each bar represent a discrete category.

What’s the difference between spearman rank correlation and pearson correlation? When to use which?

# Pearson correlation measures the linear relationship of two continues variables, while spearman rank correlation measures the monotonic relationships between two continuous or ordinal variables.

Use pearson correlation when the two variables to compare are linearly related, otherwise use spearman correlation.

Explain two plots you would use to compare two continuous sample distributions

# 1. Boxplot, with computed statistics (quantiles, median, outliers)

# 2. Violin plot

# 3. Histogram or density plot

# 4. Ecdf plot

K-means steps

Choose the K initial centroids (one for each cluster). Different methods such as sampling random observations are available for this task.
Assign each observation xi to its nearest centroid by computing the Euclidean distance between each observation to each centroid.
Update the centroids μk by taking the mean value of all of the observations assigned to each previous centroid.
Repeat steps 2 and 3 until the difference between new and former centroids is less than a previously defined threshold.

assumptions when performing k-Means

The number of clusters K is properly selected
The clusters are isotropically distributed, i.e., in each cluster the variables are not correlated and have equal variance
The clusters have equal (or similar) variance
The clusters are of similar size
normalized data

A major limitation of the K-means algorithm

it relies on a predefined number of clusters.

Hierarchical clusterung

We describe bottom-up (a.k.a. agglomerative) hierarchical clustering:

Initialization: Compute all the n(n−1)/2�(�−1)/2 pairwise dissimilarities between the n� observations. Treat each observation as its own cluster. A typically dissimilarity measure is the Euclidean distance. Other dissimilarities can be used (1-correlation), Manhattan distance, etc.
For i=n,n−1,...,2�=�,�−1,...,2:

Fuse the two clusters that are least dissimilar. The dissimilarity between these two clusters indicates the height in the dendrogram at which the fusion should be placed.
Compute the new pairwise inter-cluster dissimilarities among the i−1�−1 remaining clusters using the linkage rule.

The linkage rules define dissimilarity between clusters. Here are four popular linkage rules:

Complete: The dissimilarity between cluster A and cluster B is the largest dissimilarity between any element of A and any element of B.
Single: The dissimilarity between cluster A and cluster B is the smallest dissimilarity between any element of A and any element of B. Single linkage can result in extended, trailing clusters in which single observations are fused one-at-a-time.
Average: The dissimilarity between cluster A and cluster B is the average dissimilarity between any element of A and any element of B.
Centroid: The dissimilarity between cluster A and cluster B is the dissimilarity between the centroids (mean vector) of A and B. Centroid linkage can result in undesirable inversions.

Differences k-mean and hierarchial clustering

The time complexity of K-Means clustering is linear, while that of hierarchical clustering is quadratic.

In K-Means clustering, we start with a random choice of centroids for each cluster. Hence, the results produced by the algorithm depend on the initialization.

K-Means clustering requires the number of clusters a priori.

Rand index

The Rand index is a measure of the similarity between two partitions.

PCA def

For Principal Component Analysis (Pearson, 1901), this representation is the projection of the data on the subspace of dimension q that is closest to the data according to the sums of the squared Euclidean distances.

P-value

The P-value is the probability of obtaining a test statistic the same as or more extreme than the one we actually observed, under the assumption that the null hypothesis is true.

The formal definition of the P-value depends on whether we take “more extreme” to mean greater, less, or either way:

Permutation testing

Descriptive plots

For a univariate data these are histograms, single boxplots or violin plots. For multivariate data these are clustered heatmaps, PCA projections, etc.

Associative plots

Associative plots show how a variable depends on another variable.

Suitable plots are side-by-side boxplots, scatter plots, etc.

Elementary causal diagrams

Simpson’s paradox

This phenomenon, where a variable X seems to relate to a second variable Y in a certain way, but flips direction when the stratifying for another variable Z (grouping by Z), has since been referred to as Simpson’s paradox.

What does Reverse causality state?

Reverse causality states that if A and B correlate one might falsely interpret that A causes B even though in reality B causes A

Case resampling bootstrap

We take a sample of size n, with replacement, from our observed data, to make a new dataset.

confidence interval

A confidence interval of confidence level 1−α for a parameter θ is an interval C=(a,b) which would the data generation process be repeated, would contain the parameter with probability 1−α, i.e. p(θ∈C)=1−α. A typical value is α=0.05 which leads to 95% confidence intervals.

Typical Q-Q plots

1. Let p be the (true) probability that a coin lands on heads. You flip the coin n times. Using the resulting data, you compute an estimate for p, which we call pˆ, and a valid 95% confidence interval. This confidence interval is [0.48, 0.52]. Indicate which, if any, of the following statements are correct:

a) There is a 95% chance that 0.48 ≤ p ≤ 0.52

b) Assuming the null hypothesis is correct, there is a 95% chance that 0.48 ≤ p ≤ 0.52

c) Assuming the null hypothesis is incorrect, there is a 95% chance that 0.48 ≤ p ≤ 0.52

d) If we were to repeat the experiment 100 times, and compute a confidence interval for each replicate, we expect that only around 5 of the computed confidence intervals will not contain p

(a) is wrong. p is not random and thus it either is in the interval or it is not.

(b) and (c) are wrong for the same reason as (a). Moreover, we did not even specify a null hypothesis This is not necessary to compute a confidence interval.

(d) is correct. In fact, this is a restatement of the definition of a 95% confidence interval

Consider the following (extreme) 2x2 contingency table and assume we want to test the association of taking antiviral medicine with having symptoms from a viral disease. Indicate which, if any, of the following statements are correct:

a) A Fisher’s test, with alternative = "two.sided", applied to this table will return a low P-value

b) A Fisher’s test, with alternative = "greater", applied to this table will return a low P-value

c) A Fisher’s test, with alternative = "less", applied to this table will return a low P-value

(a) is correct

(b) is wrong. The alternative=greater

# asks whether the upper left cell

# is significantly bigger than we expect

# This is not the case here

Correct or wrong?

The t-statistic is defined based on the difference in median between groups

is wrong. It depends on the difference in mean

Correct or wrong?

If one of the groups follows a bimodal distribution, then the t-test can still be applied without issue, because the t-statistic does not depend on the mode

is wrong. Bimodality is a deviation from normality # In a bimodal distribution, the mean might not be meaningful

Correct or wrong?

If both groups follow a normal distribution, the t-test will be more powerful (i.e. more likely to detect deviations from the null) than the Wilcoxon

Correct

Correct or wrong?

If the Wilcoxon test returns a very large P-value, e.g. P > 0.9999, then we can conclude that P(X > Y ) = P(Y > X), i.e. the ranks of our two groups follow the same distribution.

is wrong. We never accept the null! # There may be a difference # But we may have too little data # Or too much noise to detect it

What’s the devil’s advocate?

X and Y come from the same distribution, i.e. values do not depend on the group.

two quantitatives: X and Y do not correlate

multiple testing

multiple testing refers to the issue that, if we test enough hypotheses at a given significance level, say α=0.05, we are bound to eventually get a significant result, even if the null hypothesis is always true.

Family-wise error rate (FWER)

Family-wise error rate (FWER): p(V>0)�(�>0), the probability of one or more false positives

What does the Benjamini-Hochberg correction control?

The false discovery rate

What is the effect of controlling the Family-wise error rate?

Controlling the Family-wise error rate ensures we have few false positives, but it comes at the cost of many false negatives.

False discovery rate

the expected fraction of false positives among all discoveries:

, where max(R, 1) ensures the denominator to not be 0.

Overview p-value, bonferoni, benjamini hochberg

As expected, we see that the nominal P−value cutoff is the most lenient, the FWER one (Bonferroni) the most stringent and the FDR ones (Benjamini-Hochberg) is intermediate.

Why do we need to be careful when we apply hypothesis testing procedures in a big data context?

As the sample size increases, even very small effects may become significant, but that does not mean that they actually matter.

Moreover, we saw that when we run many tests, some are bound to reject the null, even if the null is always true. We thus need to apply a correction.

Effect size

The effect size is a quantitative measure of the magnitude of a phenomenon. Examples of effect sizes are the correlation between two variables, the regression coefficient in a regression, the mean difference, or even the risk with which something happens, such as how many people survive after a heart attack for every one person that does not survive.

The effect size determines whether an effect is actually important.

Assume the null hypothesis is always true. If we do 200 tests, we will expect to have around … false positives when using α = 0.05 as our threshold of significance

Correct or wrong?

Assume we are using a permutation-based approach. If we use a Bonferroni correction, we will in general need to do more permutations to be able to reject the null than if we used a Benjamini-Hochberg correction.

is correct.

# The bonferroni is generally

# more conservative # Thus it demands lower P-values to reject

# Lower P-values require more permutations

Assume we are doing 1000 tests. If we let α = 0.01 and use a Bonferroni correction, then the probability of one or more false positives (falsely rejecting the null) will be …?

less than 1%

Correct or wrong?

Assume we are doing 1000 tests. If we let α = 0.01 and use a Benjaminin-Hochberg correction, then in expectation 1% of the tests we perform will reject the null

wrong. We expect that less than 1% of the positives will be false positives

linear regression: The conditional expectation of y

Linear regression can be used for various purposes:

To test conditional dependence. This is done by testing the null hypothesis:
- H0:βj=0.
To estimate the effects of one variable on the response variable. This is done by providing an estimate of the coefficient βj.
To predict the value of the response variable given values of the explanatory variables. The predicted value is then an estimate of the conditional expectation E[y|x].
To quantify how much variation of a response variable can be explained by a set of explanatory variables.

coefficient of determination or R2

The proportion of variance explained by the model

The assumptions of linear regressions are:

The expected values of the response are a linear combinations of the explanatory variables.

Errors are identically and independently distributed.

Errors follow a normal distribution with constant variance across all explanatory variable values

An implication that the errors follow a Gaussian distribution is that

the residuals also follow a Gaussian distribution. The Q-Q plot supports such distribution.

Interpret the plot from above. How can we explain the different slopes of the two linear models and the pca?

# Linear models minimize the RSS along the predicted coordinate.

# In this specific case, such coordinate is either the male student's height (first model of the students heights) or the fathers height (second model predicting the fathers heights).

# PCA minimizes the sum of squares (along both coordinates). And hence, the distance perpendicular to the first principal component. PC1 is the line with respect to which the sum of

# PC1 represents the line such that the sum of the squared distances of every point (x,y) to it is the minimum across all possible lines

sensitivity

The sensitivity refers to the fraction of actual positives that is predicted to be positive:

The sensitivity is also referred to as “recall,” “true positive rate,” or “power.”

specificity

The specificity refers to the fraction of actual negatives that is predicted to be negative:

The specificity is also known as “true negative rate” or “sensitivity of the negative class”

precision

The precision refers to the fraction of predicted positives that are indeed positives:

The precision is also called the positive predictive value. Note that, in the hypothesis testing context, we discussed a related concept, the false discovery rate (FDR). The FDR relates to the precision as follows:

ROC curve

The receiver operating characteristic curve or ROC curve is a way of evaluating the quality of a binary classifier at different cutoffs. It describes on the x axis the false positive rate (1-specificity), and on the y axis the true positive rate (sensitivity).

Which of the following is true for logistic regression?

a) It assigns classes to the datapoints.

b) There is an analytical solution for the estimation of the parameters.

c) It predicts probabilities for each of the two classes.

Suppose you are given a fair coin, p(heads) = 0.5. Which of the following are true about odds and log-odds of head?

a) The odds are 0, and the log-odds are 1.

b) The odds are 0.5, and the log-odds are approximately -0.693.

c) The odds are 1, and the log-odds are 0.

d) The odds are 1, and the log-odds are 1.

odds

The odds for a binary variable y are defined as

he specific steps for building a random forest can be formulated as follows:

residual

The residual is the difference between the observed value and the estimated value of the quantity of interest (for example, a sample mean)

error

The error of an observation is the deviation of the observed value from the true value of a quantity of interest (for example, a population mean).

residual sum of squares