permutation testing P -Value
P= r+1/m+1
m be the number of random (Monte Carlo) permutations
r = #{T ∗ ≥ Tobs} be the number of these random permutations that produce a test statistic greater than or equal to that calculated for the actual data.
precision
tp/tp+fp
false discovery rate; Abkürzung: FDR
fp/tp+fp
specificity, true negative rate
rn/rn+fp
sensitivity, true positive rate, recall
tp/tp+fn
. When do we use a line plot for visualizing data?
a. To show a connection between a series of individual data points b. To show a correlation between two quantitative variables c. To highlight individual quantitative values per category d. To compare distributions of quantitative values across categorie
# The correct answer is A # A line plot can be considered for connecting a series of individual data points # or to display the trend of a series of data points. # This can be particularly useful to show the shape of data as # it flows and changes from point to point.
What’s the result of the following command? ggplot(data = mpg)
The correct answer is B: a blank figure will be produced
What’s the result of the following command?
ggplot(data = mpg, aes(x = hwy, y = cty))
he correct answer is C: a blank figure with axes will be produced
For which type of data will boxplots produce meaningful visualizations? (2 possible answers) a. For discrete data. b. For bi-modal distributions. c. For non-Gaussian, symmetric data. d. For exponentially distributed data
Answers C and D are correct # Answer B is incorrect since boxplots are not good for bimodal # data since they only show the median and not both modes, # Boxplots are ok for both symmetric and non-symmetric data, # since the quartiles are not symmetric.
Match each chart type with the relationship it shows best.
1. shows distribution and quantiles, especially useful when comparing uni-modal distributions.
2. highlights individual values, supports comparison and can show rankings or deviations categories and totals
3. shows overall changes and patterns, usually over intervals of time
4. shows relationship between two continuous variables.
Options: bar chart, line chart, scatterplot, boxplot
# 1. boxplot
# 2. bar chart
# 3. line chart
# 4. scatterplot
. What could make k-means clustering fail?
1. When clusters are isotropically distributed.
2. When clusters have similar sizes.
3. When data are not normalized.
4. When clusters have similar variances.
# 3: k-means fails when the input variables are on very different scales as it will # skew the mean.
How is the Rand value be calculated
(0 + 4)/(5 * 4) / 2} = 0.4
# A is not possible because the maximum variance that can be explained is 100%
# B is not possible because by definition PC1 will always explain the most variance
# C is in principle possible but since our data is 4-dimensional, we will have 4 # principal components and it is very unlikely that PC3 and PC4 do not capture any variance
# D is possible
prove one analytical test for 1 binary variable
:Binomial test
classical analytical tests and their assumptions for testing the associations of pairs of variables
The number of series with the same number of heads T
The probability of the total number of heads for a fair coin
Generally, the total number of heads follows a binomial distribution
Binomial distribution in R
dbinom(x=t, size=n, prob=0.5)
R typically provides 4 functions per distribution starting with the letters
r for random draws (rbinom, rnorm,. . . )
d for density or probability mass (dbinom, dnorm,. . . )
p for cumulative distribution (pbinom, pnorm,. . . )
q for quantile (qbinom, qnorm,. . .
Binomial test P-value
The two-sided p-value is given by
Binomial test with R
binom.test(10, 12, p = 0.5, alternative = c("two.sided") )
The p-value can be extracted from the returned object:
tst <- binom.test(10, 12, p = 0.5, alternative = c("two.sided") )
tst$p.value
## [1] 0.03857422
Association between two binary variables
Hypergeometric distribution
Fisher’s exact test
Fisher’s exact test in R
The right-tailed one-sided test is obtained using alternative=“greater”.
cont_tab <- data.table( smoker = c(10, 20), nonsmoker = c(10, 70) )
fisher.test(cont_tab, alternative = "greater")
t-statistic
T-Test
t-test in R
Assuming equal variance:
tab <- getMaltoseDt("mrk_5211")
t.test(growth_rate ~ genotype, data=tab, var.equal=TRUE)
Unequal variance (Welch’s test) in R
In practice, one does not assume equal variances. The degree of freedom is slightly different (Welch’s test).
This is the default mode in R:
t.test(growth_rate ~ genotype, data=tab)
Wilcoxon test
All observations are independent of each other.
The responses are ordinal.
Under the null hypothesis H0, the probability of an observation from the population X exceeding an observation from the second population Y equals the probability of an observation from Y exceeding an observation from X: P(X > Y ) = P(Y > X). A stronger null hypothesis commonly used is “The distributions of both populations are equal” which implies the previous hypothesis.
The alternative hypothesis H1 is “the probability of an observation from the population X exceeding an observation from the second population Y is different from the probability of an observation from Y exceeding an observation from X: P(X > Y ) 6= P(Y > X).” The alternative may also be stated in terms of a one-sided test, for example: P(X > Y ) > P(Y > X).
Ranks are less sensitive to outliers
Wilcoxon rank-sum test: Mann-Whitney U statistic
Wilcoxon rank-sum test in R
wilcox.test(growth_rate ~ genotype, data=tab)
Pearson correlation coefficient
Null hypothesis. H0: X and Y is a pair from an uncorrelated bivariate normal distribution, i.e., they follow a normal distribution and are independent from each other.
Under H0, t follows a Student’s t-distribution with degrees of freedom n − 2.
Pearson correlation test in R
cor.test(anscombe[,1], anscombe[,5], method="pearson")
Spearman rank correlation test in R
Without distribution assumption, one can use rank correlation (Spearman’s ρ correlation). Reminder (rank correlation)
cor(anscombe[,1], anscombe[,5], method="spearman")
## [1] 0.8181818
cor(rank(anscombe[,1]), rank(anscombe[,5]), method="pearson")
## [1] 0.8181818 4 6 8
This uses tabulation for small n and asymptotic T-distribution approximation for large n. cor.test(anscombe[,1], anscombe[,5], method="spearman")
Spearman test is less powerful than Pearson when the data is actually Gaussian.
Spearman test is more robust to outliers.
In practice, Spearman test is often used.
Analytical Statistical Assessment
Permutation-based testing is general: one can test for any ad hoc statistics such as median difference, etc.
Permutation-based testing is computationally intensive.
Analytical tests presented in this chapter are usually much faster to compute.
In some cases (Fisher’s test, Wilcoxon test, Spearman test), non-parametric tests exist that make few assumptions on the distributions. These should be preferred.
Other tests are parametric (e.g. T-test).
Testing only accounts for one way of being misled - making conclusions from noise. It does not address other issues, such as confounding.
Testing does not establish causality. The association between smoking and respiratory disease, for example, may be due to smokers on average being older, or having other comorbidities.
One should always report, on top of the P-value:
plot to see if the data supports the analysis (the summary stastistics could be inadequate)
the effect size a confidence interval
Multiple testing can lead to misleading results if not controlled for
2 families of multiple testing corrections:
Family-Wise Error Rate
False Discovery Rate
Quantile-quantile plots: Motivation
Checking that a distribution fits the data is frequent and an important task in data analysis
Examples:
Check that the data is approximately normally distributed before running a t-test or testing coefficients of a linear regression (next week).
Check that p-values are uniformly distributed
Detect outliers in a dataset
Histograms are often not appropriate
Assuming the data is uniformly distributed we expect 10% of the data in the interval [0,0.1], 20% in the interval [0,0.2], etc.
We can compare the empirical quantiles with the values of the theoretical ones.
R:
dec <- quantile(x, seq(0,1,0.1))
dec
For a finite sample we can estimate the quantile for every data point.
One way is to use as estimator r/(N + 1), where r is the rank of the data point.
This estimator is unbiased for the uniform distribution
Using ggplot to make Q-Q plots
ggplot(data.table(x=x), aes(sample=x)) + geom_qq(distribution = stats::qunif) + geom_abline(slope=1,intercept=0) + mytheme
Effect Size
Assume a very small effect size. For instance the means of two Normal distributions with variance 1 differ by just 0.01. For small sample sizes, the difference is typically not statistically significant.
As N grows any difference can be detected.
This has implications for experimental design. One can estimate how large a sample must be to significantly detect effects of certain amplitudes.
This also means that with big data, one can easily get any small effect sizes statistically significant.
In quantitative genetics, risk factors of less than 1% for common diseases get detected. The cohorts involve 100,000s of donors.
It is important to report not only the p-value but also an estimate of the effect size.
Multiple Testing
If we test enough hypotheses at a given significance level, say α = 0.05, we are bound to eventually get a significant result, even if the null hypothesis is always true
This happens for example when stratifying large datasets into subgroups (e.g. jelly-bean colors), and doing a test for each one
This allows P-hacking: testing hypotheses until we get a significant (= publishable!) result
Need to correct for this to ensure we have robust results
Nominal p-values
nominal P-values can lead to many false rejections when testing multiple hypothese
Note an important fact about P-values: under the null hypothesis, we have that p(P < α|H0) = α. In other words, P-values are uniformly distributed under H0.
Equivalent: Under H0, the P-value follows a uniform distribution on the [0, 1] interval. This is approximately true for discrete statistics (step-function distribution).
Family-wise error rate
Bonferroni correction
the Bonferroni correction is a very conservative method to address multiple testing
dt[ , bonferroni := p.adjust(p.value, method="bonferroni")] table(dt[,.(fair, rejected = bonferroni<=0.05)])
Hence, the Queen wants to keep V as low as possible
The King does that by controlling so-called Family-wise error rate:
Family-wise error rate (FWER): p(V > 0), the probability of one or more false positives. Specifically, the King has opted to keep the FWER below α = 0.05
Under H0, if we do m tests, and reject if P < α, then we falsely reject about mα times Hence, to control FWER at level α we only want to reject if P < α m More formally, suppose we conduct our m hypothesis tests for each g = 1, ..., m , producing each a p-value Pg . Then the Bonferroni-adjusted P-values are:
Selecting all tests with P˜g ≤ α controls the FWER at level α , i.e., p(V > 0) ≤ α We can prove this using Boole’s inequality
This control does not require assumptions about dependence among the P-values or about how many of the null hypotheses are true
Bonferroni is very conservative. If we run 1,000 tests at α = 0.05, we will only reject if (unadjusted) P ≤ 0.00005.
In R we do:
p.adjust(p_values, method = "bonferroni")
Benjamini-Hochberg correction
A higher number of counterfeiters were discovered
The proportion of innocents among the subjects declared guilty (2 out of 46) remained lower than about 5%.
Controlling the Family-wise error rate ensures we have few false positives, but at the cost of many false negatives
The Benjamini-Hochberg correction controls the so-called False discovery rate (FDR) False discovery rate (FDR): the expected fraction of false positives among all discoveries:
where max(R, 1) ensures the denominator to not be 0.
We first order our (unadjusted) P-values. Let the ordered unadjusted p-values be: P1 ≤ P2 ≤ ... ≤ Pm
To control the FDR at level α, we let:
Bonferroni vs Benjamini-Hochberg
As expected, we see that nominal p-value cutoff is the most lenient the FWER one (Bonferroni) the most stringent the FDR one (Benjamini-Hochberg) is intermediate.
Lin Regression
Testing conditional dependence in general is hard
For any distribution: Gaussian, Binomial, Poisson, but also mixtures, or even distributions that are not functionally parameterized.
Condition on continuous variables. Binning? How?
Combinatorial explosion of strata. (2p strata for p binary variables). Loss of power, multiple testing.
Linear regression solves it in a particular case
modeling only conditional Gaussian distribution p(y|x1, ..., xp) whose variance is independent of x1, ..., xp.
assuming the conditional expectation of y is a simple linear combination of the explanatory variables, that is:
and hence being able to deal with continuous explanatory variables.
coping with the explosion of explanatory variable combinations by assuming that the effects of the explanatory variables do not depend on the value of the other variables -> Effects do not change between strata.
Limitations of the assumptions
The Gaussian assumption is limiting. We cannot deal with binary response variables for instance. The next chapter will address such cases.
Variance is independent of the strata can be a problem. -> Diagnostic plots
Additivity not as limiting as it seems. Often OK. If not, transform the data! E.g.:
E[weight|height] = β0 + β1height^3
Overall, in practice often reasonable. Easy and interpretable. The first model to try!
Applications
Linear regression can be used for various purposes:
to test conditional dependence. This is done by testing the null hypothesis:
H0 : βj = 0.
Examining if a drug significantly impacts blood pressure
To estimate the effects of one variable on the response variable. This is done by providing an estimate of the coefficient βj . Impact of an increase in drug dosage (βj ) on blood pressure
To predict the value of the response variable given values of the explanatory variable. The predicted value is then an estimate of the conditional expectation E[y|x]. Predicting household electricity consumption based on the number of residents
To quantify how much variation of a response variables can be explained by a set of explanatory variables. Assessing how well house features (rooms, location, . . . ) explain variability in house prices We will now see all this, starting with univariate regression.
Maximum likelihood principle
Univariante Regression
Fitted coefficients
Interpretation of the fitted coefficients
Lin Regresion
Use in R
fit <- lm(son ~ father, data = galton_heights) coefficients(fit)
## (Intercept) father ## 37.775451 0.454742
ggplot2 layer geom_smooth(method = "lm") galton_heights %>% ggplot(aes(father, son)) + geom_point(alpha = 0.5) + geom_smooth(method = "lm")
How well does the model represent our response?
R^2 in R
Testing the relationship between y and x
in R:
What can linear regression be used for?
Make predictions for future, unseen, data
Quantify explained variance
Model linear relationship between variables
Multivariate linear regression
Lin Regression Multivaraiante
Least squares estimates are unbiased
in R
Lin Regression Multiplevaraiante
Nested models and hypothesis testing
Sometimes we are interested in testing an entire set of variables.
We then compare so-called nested models. We say that a model A is nested into a model B if the parameters in A are a subset of the parameters of model B.
We compare a full model Ω to a reduced model ω where model ω is a special case of the more general model Ω.
E.g. Consider the following example model: y = β0 + β1x1 + β2x2 + β3x3
We can test two predictors x1 and x2 with: Full model: all β’s can take any value. Reduced model: β1 = β2 = 0 (only the intercept β0 and the third parameter β3 can take any value).
ANOVA and the F-test
Diagnostics
Assumptions of a mathematical model never hold exactly in practice
Questions are: How badly are we violating them? What are the implications for our conclusions? How can we address these?
The assumptions of linear regressions are: The expected values of the response are a linear combinations of the explanatory variables Errors are identically and independently distributed.
Errors follow a normal distribution.
Two diagnostic plots will help us:
the residual plot and q-q plot of the residuals.
residual plots
This plots the residual against the predicted response values ϵˆ ∼ yˆ
Unlike the observation against the prediction (y ∼ yˆ), we do not have to “turn” the head to observe issues on the residuals.
Heteroscedascity
It happens not so rarely that the variance of the residuals is not constant across all data points.
This property is called heteroscedascity.
This violates the i.i.d assumption of the errors.
The residual plots can help spotting when error variance depends on response mean
The fit can be suboptimal for samll sample size because the least squares errors give too much importance to the points with high noise.
The statistical tests are flawed.
However, with large sample size the fit can still be correct
Heteroscedascity: What to do?
Try to transform the response variable y,e g:
log transformation
square root transformation
variance stabilizing transformation
However, as before, the appropriate transformation is difficult to know in practice and finding it likely requires trying out several options. Alternatively, use methods with a different noise model.
weighted least squares if your data comes with variance estimates (eg experimental errors)
generalized linear models (next chapter)
Logistic regression
Logistic regression is an adaptation of linear regression for binary outcomes
It can be used as baseline classifier and often does the job
It is the final layer of neural network classifiers
Its objective functions (cross-entropy) is widely used in machine learning
Generalized linear models (GLM)
Logistic regression with R
Logistic regression and changes in class proportions
Precision-recall curves
Precision ( TP P ) against the recall ( TP TP+FP ).
To plot a precision-recall curve we can use the package PRROC.
ROC curve
Going through all possible cutoffs, a ROC curve plots: on the x axis the false positive rate (1-specificity)
on the y axis the true positive rate (sensitivity)
library(plotROC)
heights[, random_scores:=runif(.N)]
heights_melted <- heights[, .(y, mu_hat, random_scores)] %>% melt(id.vars="y", variable.name = "logistic_fit", value.name="response")
ggroc <- ggplot(heights_melted, aes(d=y, m=response, color=logistic_fit)) + geom_roc() + scale_color_discrete(name = "Logistic fit", labels = c("Univariate", "Random")) + geom_abline() ggroc
Supervised learning vs. unsupervised learning
Supervised learning Goal:
build a powerful algorithm that takes feature values as input and returns a prediction for an outcome.
For this, we train an algorithm using a dataset for which we associate each set of feature values to a certain outcome
Then, we use this trained model to make predictions. Examples: logistic regression, linear regression, random forests.
Unsupervised learning
We do not associate feature values to an outcome variable that can supervise our analysis and model building.
Used to identify patterns in the distribution of data. Examples: clustering and dimensionality reduction.
Over and Underfitting
We want to analyze whether our built model is generalizing well and capturing the trend of the data. Two situations should be avoided:
Under-fitting: The trained model does not capture the trends of the data. High measured error between actual and predicted outcomes
Over-fitting: The model fits the data used for training the model too well. No generalization to other datasets
polynomial curve fitting
For a given model complexity (here: polynomial of order m = 9), the over-fitting problem becomes less severe as the size of the data set increases
In supervised learning, what guides our choice of the fit is data and only data and not theoretical considerations about the underlying process.
The challenge is not reducing the error on data at hand but the error on unseen data, called the generalization error
minimize error on data we have never seen
Cross-validation to control the generalization error
Cross-validation in R using the caret package
The lr_fit object returns the average values for the computed values of sensitivity, specificity and the area under the ROC curve of the k=5 trained models: lr_fit
The performance measurements for each model can be obtained as follows:
lr_fit$resample
The final model can be obtained by accessing the attribute finalModel from the lr_fit object: lr_fit$finalModel
Random forests:
Random forests are easy to understand and achieve state-of-the-art performance in many prediction tasks.
They can be applied to both regression and classification tasks
Random forests are an instance of tree-based ensemble learning, which is a strategy that is robust to over-fitting
Each random forest is an ensemble of several decision trees
Idea: partition or (segment) the training data set into a number of simple regions with the help of splitting rules
Once the input space of the training dataset has been segmented, make a prediction for a new input sample by using the mean or the mode of the training observations in the region to which the new input sample is assigned to
Decision trees for regression tasks: minimize RSS
Decision trees for regression tasks: splitting rules
It is computationally unfeasible to consider every possible partition of the training dataset. Instead, follow a top-down greedy approach to build the tree.
Top-down: begin at the top of the tree with the complete dataset and then successively split the dataset into two new branches
Greedy: at each step of the process, the best possible split is made, i.e. the split that results the greatest possible reduction in RSS
Finding optimal values for thresholds and features can be achieved quickly with the current advances in computer science
The iterative process of finding features and thresholds that define splitting rules continues until a stopping criterion is met.
This prevents having to consider every possible partition of the training dataset.
Stopping criteria include setting a minimum number of samples in a region or defining a maximum number of resulting regions.
Decision trees for classification tasks
The construction process of classification decision trees is very similar to that for regression tasks
Difference in the predicted outcomes:
For a regression tree, the predicted outcome for an observation is given by the mean response of the training observations that belong to the region
For a classification tree, we predict that each observation belongs to the most commonly occurring class of training observations in the region to which it belongs.
Random forest algorithm
The specific steps for building a random forest can be formulated as follows:
Build B decision trees using the training set. We refer to the fitted models as T1, T2, ..., TB.
For every observation in the test set, form a prediction yˆj using tree Tj
For continuous outcomes, form a final prediction with the average yˆ = 1 B PB j=1 yˆj . For categorical outcomes, predict yˆ with majority vote (most frequent class among yˆ1, ..., yˆB).
To create the trees T1, T2, ..., TB from the training set we do:
Create a bootstrap training set by sampling N observations from the training set without replacement. In this manner, each tree is trained on a different dataset.
Randomly select a subset of the features to be included in the building of each tree. Not all features are considered for each tree. A different random subset is selected for each tree. This reduces correlation between trees in the forest and prevents over-fitting.
Random Forests in R
Hyper parameters of random forests: the number of trees (ntree), the minimum size of leaf nodes (nodesize) and the maximum number of leaf nodes (maxnodes) in each tree, among others
Hyper parameters should be optimized (or tuned) for each application to achieve a better performance of the models
Two possible approaches are grid search and random reach4 (out of scope in this lecture)
Supervised Learning
Supervised machine learning:
Build and evaluate models that identify trends in the training dataset but are still able to generalize to (new) independent datasets
Cross-validation as a strategy to prevent over-fitting Cross-validation pitfalls
Show good performance on many application tasks RF fitting for classification and for regression tasks Decision trees
Bagging
“Random” with respect to features and samples
Why is this dataset messy?
## 1. Values are stored as columns (days)
## 2. A single entity is scattered across many cells (date)
## 3. Element column is not a variable.
How do you perform the data table merge pictured here?
#4. Right, all.y = TRUE
3. How would a tidy version of it look like? Do not give the code, only describe how the tidy table would look like.
Tidy version: id, date, tmin, tmax
Create a tidy version of the weather dataset.
dt <- melt(messy_dt, id.vars = c("id", "year", "month", "element"), variable.name = "day", value.name = 'temp')
dt[, day := as.integer(gsub("d", "", day))] 7 dt[, date := paste(year, month, day, sep = "-")] # option using paste
dt[, c("year", "month", "day") := NULL] # remove reduntant columns
# alternatively one can use unite from the tidyr package
# dt <- unite(dt, "date", c("year", "month", "day"), sep = "-", remove = TRUE)
dt[, element := tolower(element)] # TMAX -> tmax
dt <- dcast(dt, ... ~ element, value.var = "temp") # long -> wide
dt <- dt[!(is.na(tmax) & is.na(tmin))]
# remove entries with both NA values, # na.omit(dt) would also do the job head(dt
Name 3 properties defining tidy data
each variable must have his own colum
each observation must have his own row
each value must have his own cell
which summarx statistics are shown by the r standard boxplot
middel bar = median
box = 1 and 3 quantile
whiskers = most extreme data points within +/- 1,5 interquantil range
ROC
X Achse false positive Rate FP/FP+TN
Y Achse Sensitifity TP/TP+FN
Last changed10 months ago