undefined

by Matthäus R.

permutation testing P -Value

P= r+1/m+1

m be the number of random (Monte Carlo) permutations

r = #{T ∗ ≥ Tobs} be the number of these random permutations that produce a test statistic greater than or equal to that calculated for the actual data.

precision

tp/tp+fp

false discovery rate; Abkürzung: FDR

fp/tp+fp

specificity, true negative rate

rn/rn+fp

sensitivity, true positive rate, recall

tp/tp+fn

. When do we use a line plot for visualizing data?

a. To show a connection between a series of individual data points b. To show a correlation between two quantitative variables c. To highlight individual quantitative values per category d. To compare distributions of quantitative values across categorie

# The correct answer is A # A line plot can be considered for connecting a series of individual data points # or to display the trend of a series of data points. # This can be particularly useful to show the shape of data as # it flows and changes from point to point.

What’s the result of the following command? ggplot(data = mpg)

The correct answer is B: a blank figure will be produced

What’s the result of the following command?

ggplot(data = mpg, aes(x = hwy, y = cty))

he correct answer is C: a blank figure with axes will be produced

For which type of data will boxplots produce meaningful visualizations? (2 possible answers) a. For discrete data. b. For bi-modal distributions. c. For non-Gaussian, symmetric data. d. For exponentially distributed data

Answers C and D are correct # Answer B is incorrect since boxplots are not good for bimodal # data since they only show the median and not both modes, # Boxplots are ok for both symmetric and non-symmetric data, # since the quartiles are not symmetric.

Match each chart type with the relationship it shows best.

1. shows distribution and quantiles, especially useful when comparing uni-modal distributions.

2. highlights individual values, supports comparison and can show rankings or deviations categories and totals

3. shows overall changes and patterns, usually over intervals of time

4. shows relationship between two continuous variables.

Options: bar chart, line chart, scatterplot, boxplot

# 1. boxplot

# 2. bar chart

# 3. line chart

# 4. scatterplot

. What could make k-means clustering fail?

1. When clusters are isotropically distributed.

2. When clusters have similar sizes.

3. When data are not normalized.

4. When clusters have similar variances.

# 3: k-means fails when the input variables are on very different scales as it will # skew the mean.

How is the Rand value be calculated

(0 + 4)/(5 * 4) / 2} = 0.4

# A is not possible because the maximum variance that can be explained is 100%

# B is not possible because by definition PC1 will always explain the most variance

# C is in principle possible but since our data is 4-dimensional, we will have 4 # principal components and it is very unlikely that PC3 and PC4 do not capture any variance

# D is possible

prove one analytical test for 1 binary variable

:Binomial test

classical analytical tests and their assumptions for testing the associations of pairs of variables

The number of series with the same number of heads T

The probability of the total number of heads for a fair coin

Generally, the total number of heads follows a binomial distribution

Binomial distribution in R

dbinom(x=t, size=n, prob=0.5)

R typically provides 4 functions per distribution starting with the letters

r for random draws (rbinom, rnorm,. . . )

d for density or probability mass (dbinom, dnorm,. . . )

p for cumulative distribution (pbinom, pnorm,. . . )

q for quantile (qbinom, qnorm,. . .

Binomial test P-value

The two-sided p-value is given by

Binomial test with R

binom.test(10, 12, p = 0.5, alternative = c("two.sided") )

The p-value can be extracted from the returned object:

tst <- binom.test(10, 12, p = 0.5, alternative = c("two.sided") )

tst$p.value

## [1] 0.03857422

Association between two binary variables

Hypergeometric distribution

Fisher’s exact test

Fisher’s exact test in R

The right-tailed one-sided test is obtained using alternative=“greater”.

cont_tab <- data.table( smoker = c(10, 20), nonsmoker = c(10, 70) )

fisher.test(cont_tab, alternative = "greater")

t-statistic

T-Test

t-test in R

Assuming equal variance:

tab <- getMaltoseDt("mrk_5211")

t.test(growth_rate ~ genotype, data=tab, var.equal=TRUE)

Unequal variance (Welch’s test) in R

In practice, one does not assume equal variances. The degree of freedom is slightly different (Welch’s test).

This is the default mode in R:

t.test(growth_rate ~ genotype, data=tab)

Wilcoxon test

All observations are independent of each other.

The responses are ordinal.

Under the null hypothesis H0, the probability of an observation from the population X exceeding an observation from the second population Y equals the probability of an observation from Y exceeding an observation from X: P(X > Y ) = P(Y > X). A stronger null hypothesis commonly used is “The distributions of both populations are equal” which implies the previous hypothesis.

The alternative hypothesis H1 is “the probability of an observation from the population X exceeding an observation from the second population Y is different from the probability of an observation from Y exceeding an observation from X: P(X > Y ) 6= P(Y > X).” The alternative may also be stated in terms of a one-sided test, for example: P(X > Y ) > P(Y > X).

Ranks are less sensitive to outliers

Wilcoxon rank-sum test: Mann-Whitney U statistic

Wilcoxon rank-sum test in R

tab <- getMaltoseDt("mrk_5211")

wilcox.test(growth_rate ~ genotype, data=tab)

Pearson correlation coefficient

Null hypothesis. H0: X and Y is a pair from an uncorrelated bivariate normal distribution, i.e., they follow a normal distribution and are independent from each other.

Under H0, t follows a Student’s t-distribution with degrees of freedom n − 2.

Pearson correlation test in R

cor.test(anscombe[,1], anscombe[,5], method="pearson")

Spearman rank correlation test in R

Without distribution assumption, one can use rank correlation (Spearman’s ρ correlation). Reminder (rank correlation)

cor(anscombe[,1], anscombe[,5], method="spearman")

## [1] 0.8181818

cor(rank(anscombe[,1]), rank(anscombe[,5]), method="pearson")

## [1] 0.8181818 4 6 8

This uses tabulation for small n and asymptotic T-distribution approximation for large n. cor.test(anscombe[,1], anscombe[,5], method="spearman")

Spearman test is less powerful than Pearson when the data is actually Gaussian.

Spearman test is more robust to outliers.

In practice, Spearman test is often used.

Analytical Statistical Assessment

Permutation-based testing is general: one can test for any ad hoc statistics such as median difference, etc.

Permutation-based testing is computationally intensive.

Analytical tests presented in this chapter are usually much faster to compute.

In some cases (Fisher’s test, Wilcoxon test, Spearman test), non-parametric tests exist that make few assumptions on the distributions. These should be preferred.

Other tests are parametric (e.g. T-test).

Testing only accounts for one way of being misled - making conclusions from noise. It does not address other issues, such as confounding.

Testing does not establish causality. The association between smoking and respiratory disease, for example, may be due to smokers on average being older, or having other comorbidities.

One should always report, on top of the P-value:

plot to see if the data supports the analysis (the summary stastistics could be inadequate)

the effect size a confidence interval

Multiple testing can lead to misleading results if not controlled for

2 families of multiple testing corrections:

Family-Wise Error Rate

False Discovery Rate

Quantile-quantile plots: Motivation

Checking that a distribution fits the data is frequent and an important task in data analysis

Examples:

Check that the data is approximately normally distributed before running a t-test or testing coefficients of a linear regression (next week).

Check that p-values are uniformly distributed

Detect outliers in a dataset

Histograms are often not appropriate

Assuming the data is uniformly distributed we expect 10% of the data in the interval [0,0.1], 20% in the interval [0,0.2], etc.

We can compare the empirical quantiles with the values of the theoretical ones.

dec <- quantile(x, seq(0,1,0.1))

dec

For a finite sample we can estimate the quantile for every data point.

One way is to use as estimator r/(N + 1), where r is the rank of the data point.

This estimator is unbiased for the uniform distribution

Using ggplot to make Q-Q plots

ggplot(data.table(x=x), aes(sample=x)) + geom_qq(distribution = stats::qunif) + geom_abline(slope=1,intercept=0) + mytheme

Effect Size

Assume a very small effect size. For instance the means of two Normal distributions with variance 1 differ by just 0.01. For small sample sizes, the difference is typically not statistically significant.

As N grows any difference can be detected.

This has implications for experimental design. One can estimate how large a sample must be to significantly detect effects of certain amplitudes.

This also means that with big data, one can easily get any small effect sizes statistically significant.

In quantitative genetics, risk factors of less than 1% for common diseases get detected. The cohorts involve 100,000s of donors.

It is important to report not only the p-value but also an estimate of the effect size.

Multiple Testing

If we test enough hypotheses at a given significance level, say α = 0.05, we are bound to eventually get a significant result, even if the null hypothesis is always true

This happens for example when stratifying large datasets into subgroups (e.g. jelly-bean colors), and doing a test for each one

This allows P-hacking: testing hypotheses until we get a significant (= publishable!) result

Need to correct for this to ensure we have robust results

Nominal p-values

nominal P-values can lead to many false rejections when testing multiple hypothese

Note an important fact about P-values: under the null hypothesis, we have that p(P < α|H0) = α. In other words, P-values are uniformly distributed under H0.

Equivalent: Under H0, the P-value follows a uniform distribution on the [0, 1] interval. This is approximately true for discrete statistics (step-function distribution).

Family-wise error rate

Bonferroni correction

the Bonferroni correction is a very conservative method to address multiple testing

dt[ , bonferroni := p.adjust(p.value, method="bonferroni")] table(dt[,.(fair, rejected = bonferroni<=0.05)])

Hence, the Queen wants to keep V as low as possible

The King does that by controlling so-called Family-wise error rate:

Family-wise error rate (FWER): p(V > 0), the probability of one or more false positives. Specifically, the King has opted to keep the FWER below α = 0.05

Under H0, if we do m tests, and reject if P < α, then we falsely reject about mα times Hence, to control FWER at level α we only want to reject if P < α m More formally, suppose we conduct our m hypothesis tests for each g = 1, ..., m , producing each a p-value Pg . Then the Bonferroni-adjusted P-values are:

Selecting all tests with P˜g ≤ α controls the FWER at level α , i.e., p(V > 0) ≤ α We can prove this using Boole’s inequality

This control does not require assumptions about dependence among the P-values or about how many of the null hypotheses are true

Bonferroni is very conservative. If we run 1,000 tests at α = 0.05, we will only reject if (unadjusted) P ≤ 0.00005.

In R we do:

p.adjust(p_values, method = "bonferroni")

False Discovery Rate

Benjamini-Hochberg correction

A higher number of counterfeiters were discovered

The proportion of innocents among the subjects declared guilty (2 out of 46) remained lower than about 5%.

Controlling the Family-wise error rate ensures we have few false positives, but at the cost of many false negatives

The Benjamini-Hochberg correction controls the so-called False discovery rate (FDR) False discovery rate (FDR): the expected fraction of false positives among all discoveries:

where max(R, 1) ensures the denominator to not be 0.

We first order our (unadjusted) P-values. Let the ordered unadjusted p-values be: P1 ≤ P2 ≤ ... ≤ Pm

To control the FDR at level α, we let:

Bonferroni vs Benjamini-Hochberg

As expected, we see that nominal p-value cutoff is the most lenient the FWER one (Bonferroni) the most stringent the FDR one (Benjamini-Hochberg) is intermediate.

Lin Regression

Testing conditional dependence in general is hard

For any distribution: Gaussian, Binomial, Poisson, but also mixtures, or even distributions that are not functionally parameterized.

Condition on continuous variables. Binning? How?

Combinatorial explosion of strata. (2p strata for p binary variables). Loss of power, multiple testing.

Linear regression solves it in a particular case

modeling only conditional Gaussian distribution p(y|x1, ..., xp) whose variance is independent of x1, ..., xp.

assuming the conditional expectation of y is a simple linear combination of the explanatory variables, that is:

and hence being able to deal with continuous explanatory variables.

coping with the explosion of explanatory variable combinations by assuming that the effects of the explanatory variables do not depend on the value of the other variables -> Effects do not change between strata.

Lin Regression

Limitations of the assumptions

The Gaussian assumption is limiting. We cannot deal with binary response variables for instance. The next chapter will address such cases.

Variance is independent of the strata can be a problem. -> Diagnostic plots

Additivity not as limiting as it seems. Often OK. If not, transform the data! E.g.:

E[weight|height] = β0 + β1height^3

Overall, in practice often reasonable. Easy and interpretable. The first model to try!

Lin Regression

Applications

Linear regression can be used for various purposes:

to test conditional dependence. This is done by testing the null hypothesis:

H0 : βj = 0.

Examining if a drug significantly impacts blood pressure

To estimate the effects of one variable on the response variable. This is done by providing an estimate of the coefficient βj . Impact of an increase in drug dosage (βj ) on blood pressure

To predict the value of the response variable given values of the explanatory variable. The predicted value is then an estimate of the conditional expectation E[y|x]. Predicting household electricity consumption based on the number of residents

To quantify how much variation of a response variables can be explained by a set of explanatory variables. Assessing how well house features (rooms, location, . . . ) explain variability in house prices We will now see all this, starting with univariate regression.

Lin Regression

Maximum likelihood principle

Lin Regression

Univariante Regression

Lin Regression

Fitted coefficients

Lin Regression

Interpretation of the fitted coefficients

Lin Regresion

Use in R

fit <- lm(son ~ father, data = galton_heights) coefficients(fit)

## (Intercept) father ## 37.775451 0.454742

ggplot2 layer geom_smooth(method = "lm") galton_heights %>% ggplot(aes(father, son)) + geom_point(alpha = 0.5) + geom_smooth(method = "lm")

Lin Regression

How well does the model represent our response?

Lin Regression

R^2 in R

Lin Regression

Testing the relationship between y and x

in R:

What can linear regression be used for?

Make predictions for future, unseen, data

Quantify explained variance

Model linear relationship between variables

Lin Regression

Multivariate linear regression

Lin Regression Multivaraiante

Least squares estimates are unbiased

Lin Regression Multivaraiante

in R

Lin Regression Multiplevaraiante

Nested models and hypothesis testing

Sometimes we are interested in testing an entire set of variables.

We then compare so-called nested models. We say that a model A is nested into a model B if the parameters in A are a subset of the parameters of model B.

We compare a full model Ω to a reduced model ω where model ω is a special case of the more general model Ω.

E.g. Consider the following example model: y = β0 + β1x1 + β2x2 + β3x3

We can test two predictors x1 and x2 with: Full model: all β’s can take any value. Reduced model: β1 = β2 = 0 (only the intercept β0 and the third parameter β3 can take any value).

ANOVA and the F-test

Lin Regression

Diagnostics

Assumptions of a mathematical model never hold exactly in practice

Questions are: How badly are we violating them? What are the implications for our conclusions? How can we address these?

The assumptions of linear regressions are: The expected values of the response are a linear combinations of the explanatory variables Errors are identically and independently distributed.

Errors follow a normal distribution.

Two diagnostic plots will help us:

the residual plot and q-q plot of the residuals.

residual plots

This plots the residual against the predicted response values ϵˆ ∼ yˆ

Unlike the observation against the prediction (y ∼ yˆ), we do not have to “turn” the head to observe issues on the residuals.

Heteroscedascity

It happens not so rarely that the variance of the residuals is not constant across all data points.

This property is called heteroscedascity.

This violates the i.i.d assumption of the errors.

The residual plots can help spotting when error variance depends on response mean

The fit can be suboptimal for samll sample size because the least squares errors give too much importance to the points with high noise.

The statistical tests are flawed.

However, with large sample size the fit can still be correct

Heteroscedascity: What to do?

Try to transform the response variable y,e g:

log transformation

square root transformation

variance stabilizing transformation

However, as before, the appropriate transformation is difficult to know in practice and finding it likely requires trying out several options. Alternatively, use methods with a different noise model.

weighted least squares if your data comes with variance estimates (eg experimental errors)

generalized linear models (next chapter)

Logistic regression

Logistic regression is an adaptation of linear regression for binary outcomes

It can be used as baseline classifier and often does the job

It is the final layer of neural network classifiers

Its objective functions (cross-entropy) is widely used in machine learning

Generalized linear models (GLM)

Logistic regression with R

Logistic regression and changes in class proportions

Precision-recall curves

Precision ( TP P ) against the recall ( TP TP+FP ).

To plot a precision-recall curve we can use the package PRROC.

ROC curve

Going through all possible cutoffs, a ROC curve plots: on the x axis the false positive rate (1-specificity)

on the y axis the true positive rate (sensitivity)

library(plotROC)

heights[, random_scores:=runif(.N)]

heights_melted <- heights[, .(y, mu_hat, random_scores)] %>% melt(id.vars="y", variable.name = "logistic_fit", value.name="response")

ggroc <- ggplot(heights_melted, aes(d=y, m=response, color=logistic_fit)) + geom_roc() + scale_color_discrete(name = "Logistic fit", labels = c("Univariate", "Random")) + geom_abline() ggroc

Supervised learning vs. unsupervised learning

Supervised learning Goal:

build a powerful algorithm that takes feature values as input and returns a prediction for an outcome.

For this, we train an algorithm using a dataset for which we associate each set of feature values to a certain outcome

Then, we use this trained model to make predictions. Examples: logistic regression, linear regression, random forests.

Unsupervised learning

We do not associate feature values to an outcome variable that can supervise our analysis and model building.

Used to identify patterns in the distribution of data. Examples: clustering and dimensionality reduction.

Over and Underfitting

We want to analyze whether our built model is generalizing well and capturing the trend of the data. Two situations should be avoided:

Under-fitting: The trained model does not capture the trends of the data. High measured error between actual and predicted outcomes

Over-fitting: The model fits the data used for training the model too well. No generalization to other datasets

polynomial curve fitting

For a given model complexity (here: polynomial of order m = 9), the over-fitting problem becomes less severe as the size of the data set increases

In supervised learning, what guides our choice of the fit is data and only data and not theoretical considerations about the underlying process.

The challenge is not reducing the error on data at hand but the error on unseen data, called the generalization error

minimize error on data we have never seen

Cross-validation to control the generalization error

Cross-validation in R using the caret package

The lr_fit object returns the average values for the computed values of sensitivity, specificity and the area under the ROC curve of the k=5 trained models: lr_fit

The performance measurements for each model can be obtained as follows:

lr_fit$resample

The final model can be obtained by accessing the attribute finalModel from the lr_fit object: lr_fit$finalModel

Random forests:

Random forests are easy to understand and achieve state-of-the-art performance in many prediction tasks.

They can be applied to both regression and classification tasks

Random forests are an instance of tree-based ensemble learning, which is a strategy that is robust to over-fitting

Each random forest is an ensemble of several decision trees

Idea: partition or (segment) the training data set into a number of simple regions with the help of splitting rules

Once the input space of the training dataset has been segmented, make a prediction for a new input sample by using the mean or the mode of the training observations in the region to which the new input sample is assigned to

Decision trees for regression tasks: minimize RSS

Decision trees for regression tasks: splitting rules

It is computationally unfeasible to consider every possible partition of the training dataset. Instead, follow a top-down greedy approach to build the tree.

Top-down: begin at the top of the tree with the complete dataset and then successively split the dataset into two new branches

Greedy: at each step of the process, the best possible split is made, i.e. the split that results the greatest possible reduction in RSS

Finding optimal values for thresholds and features can be achieved quickly with the current advances in computer science

The iterative process of finding features and thresholds that define splitting rules continues until a stopping criterion is met.

This prevents having to consider every possible partition of the training dataset.

Stopping criteria include setting a minimum number of samples in a region or defining a maximum number of resulting regions.

Decision trees for classification tasks

The construction process of classification decision trees is very similar to that for regression tasks

Difference in the predicted outcomes:

For a regression tree, the predicted outcome for an observation is given by the mean response of the training observations that belong to the region

For a classification tree, we predict that each observation belongs to the most commonly occurring class of training observations in the region to which it belongs.

Random forest algorithm

The specific steps for building a random forest can be formulated as follows:

Build B decision trees using the training set. We refer to the fitted models as T1, T2, ..., TB.

For every observation in the test set, form a prediction yˆj using tree Tj

For continuous outcomes, form a final prediction with the average yˆ = 1 B PB j=1 yˆj . For categorical outcomes, predict yˆ with majority vote (most frequent class among yˆ1, ..., yˆB).

To create the trees T1, T2, ..., TB from the training set we do:

Create a bootstrap training set by sampling N observations from the training set without replacement. In this manner, each tree is trained on a different dataset.

Randomly select a subset of the features to be included in the building of each tree. Not all features are considered for each tree. A different random subset is selected for each tree. This reduces correlation between trees in the forest and prevents over-fitting.

Random Forests in R

Hyper parameters of random forests: the number of trees (ntree), the minimum size of leaf nodes (nodesize) and the maximum number of leaf nodes (maxnodes) in each tree, among others

Hyper parameters should be optimized (or tuned) for each application to achieve a better performance of the models

Two possible approaches are grid search and random reach4 (out of scope in this lecture)

Supervised Learning

Supervised machine learning:

Build and evaluate models that identify trends in the training dataset but are still able to generalize to (new) independent datasets

Cross-validation as a strategy to prevent over-fitting Cross-validation pitfalls

Random forests:

Show good performance on many application tasks RF fitting for classification and for regression tasks Decision trees

Bagging

“Random” with respect to features and samples

Why is this dataset messy?

## 1. Values are stored as columns (days)

## 2. A single entity is scattered across many cells (date)

## 3. Element column is not a variable.

How do you perform the data table merge pictured here?

#4. Right, all.y = TRUE

3. How would a tidy version of it look like? Do not give the code, only describe how the tidy table would look like.

Tidy version: id, date, tmin, tmax

Create a tidy version of the weather dataset.

dt <- melt(messy_dt, id.vars = c("id", "year", "month", "element"), variable.name = "day", value.name = 'temp')

dt[, day := as.integer(gsub("d", "", day))] 7 dt[, date := paste(year, month, day, sep = "-")] # option using paste

dt[, c("year", "month", "day") := NULL] # remove reduntant columns

# alternatively one can use unite from the tidyr package

# dt <- unite(dt, "date", c("year", "month", "day"), sep = "-", remove = TRUE)

dt[, element := tolower(element)] # TMAX -> tmax

dt <- dcast(dt, ... ~ element, value.var = "temp") # long -> wide

dt <- dt[!(is.na(tmax) & is.na(tmin))]

# remove entries with both NA values, # na.omit(dt) would also do the job head(dt

Name 3 properties defining tidy data

each variable must have his own colum

each observation must have his own row

each value must have his own cell

which summarx statistics are shown by the r standard boxplot

middel bar = median

box = 1 and 3 quantile

whiskers = most extreme data points within +/- 1,5 interquantil range

ROC

X Achse false positive Rate FP/FP+TN

Y Achse Sensitifity TP/TP+FN

Join Course

Preview

Author

Matthäus R.

Information

Last changed
a year ago

Report course

DataViz

Author

Matthäus R.

Information