Why is it better to leave missing values (NA) inside an object, than to replace them with zeros?

# Missing values take into account that the data point is actually not there and should be # ignored for further computation. If we supply a zero instead, it will have an actual value # which will effect the computed metric.

If is.matrix(x) is TRUE, what will is.array(x) return? Why?

# It will also return TRUE because matrices are basically a special # case of the array, which has two dimensions.

Let df be the data frame below. What would be the output of sapply(df, sum)?

# It would be a named vector where the elements are x=6 and y=30.

## B. Return a 3x2 data.table.

Look at the following data.tables and decide what kind of merging (merge(dt1, dt2, by = "id", ...)) was done.

## C. It's a left merge / join.

Which transformations would you need to make in order to bring the previous dataset into tidy form? Describe the procedure in steps. Write a code example for each of the steps.

## - Variables are stored in columns (d1-d8)

## - The 'element' column is not a variable; it stores the names of variables

## - One variable (date) is scattered across many different columns/cells

## Tidy version has columns:

## - id, date, tmin, tmax

What’s the difference between a histogram and a barplot? Give an example of (sketch) each of them.

# Histogram is used to show the distribution of a variable (mostly continuous),

# while barplot is used to compare between groups.

# In histogram, each bar represent a group of binned quantitative data,

# while in barplot, each bar represent a discrete category.

Explain two plots you would use to compare two continuous sample distributions.

# 1. Boxplot, with computed statistics (quantiles, median, outliers)

# 2. Violin plot

# 3. Histogram or density plot

# 4. Ecdf plot

What’s the difference between spearman rank correlation and pearson correlation? When to use which?

# Pearson correlation measures the linear relationship of two continues variables,

# while spearman rank correlation measures the monotonic relationships between two

# continuous or ordinal variables. Use pearson correlation when the two variables to

# compare are linearly related, otherwise use spearman correlation.

Name 2 advantages of Gaussian mixture models over the k-means algorithm.

# Probabilistic frameworks allows to determine number of clusters based on information criteria # * Akaike information criterion (AIC)

# * Bayesian information criterion (BIC)

# Clusters are allowed individual covariances

# Clusters have individual prior probabilities

What is the Rand index useful for?

# To evaluate and compare clustering results possibluy with diffrent number of clusters.

What is the formula of the Rand index? Explain every term.

What is the definition of p value? Give mathematical notations for double tail events.

# Probability that the statistics T would be the same as or more extreme than the actual

# observed results Tobs, under the null hypothesis.

# P = 2 * min{ p(T <= Tobs|Ho), p(T >= Tobs|Ho) }

What is the formula to estimate the p-value from Monte Carlo permutations schemes? Explain every term.

# P= (r+1) / (m+1)

# m be the number of random (Monte Carlo) permutations

# r = #{T* >= Tobs} be the number of these random permutations that produce a

# test statistic greater than or equal to that calculated for the actual data.

Assume you are performing many statistical tests. Which of the multiple testing correction methods can you apply if you want to make sure that no single test is rejected with a certain significance level?

A. Benjamini-Hochberg

B. Bonferroni

C. Hochberg

# B and C are true.

How to solve the practical issue of Monte Carlo permutations schemes when there are many statistical tests performed (for ex., testing genetic associations with 1 million genetic markers)?

# Using statistical tests with analytical solutions or approximations

# (Student's t-test, Welch's t-test, Fischer test)

When comparing the means of two samples of equal size, which of the following exclusive statements is true? Given the group sample means and standard deviations. . .

A. . . . , neither the T-statistics nor the p-value depends on the sample size.

B. . . . , the T-statistics, but not the p-value, depends on the sample size.

C. . . . , the p-value, but not the T-statistics, depends on the sample size.

D. . . . , both the p-value and the T-statistics depend on the sample size.

# C. ..., the p-value, but not the T-statistics, depends on the sample size.

# (via the degree of freedom).

# The larger the sample size, the more significant.

Consider the simple linear model y = α + βx + ε

Name one test statistic that is used to test for a linear relation between x and y in the above model.

# t = \hat{beta} / se(\hat{beta})

# Likelihood ratio test

# F-test

This question and the next one refer to the following plot. It shows the residuals ε against some predicted values yˆ. N = 100.

How would you solve this problem?

# Variance is not constant: heteroscedascity

# transformation of the response y

# - log transformation

# - square root transformation

# - variance stabilizing transformation

# Use a generalized linar model

Linearity

# investigate further terms e.g. x**2

Consider a multiple linear regression model for testing differences of means between three groups: y = β0 + β1x1 + β2x2. The groups G0, G1 and G2 are encoded such that xi = 1 if the data point is from group Gi and values 0 otherwise. State the null hypothesis that all three groups have the same mean value.

# H0: \beta_1 = \beta_2 = 0

Which four core values does the confusion matrix of a classifier contain (tip: think about what the classifier claims, and what is really true).

# true positives, TP: classifier claims it is a positive sample, and that is correct

# false positives, FP: classifier claims it is a positive sample, but it is actually negative

# false negatives, FN: classifier claims it is a negative sample, but it is actually positive

# true negatives, TN: classifier claims it is a negative sample, and that is correct

Why can a classifier with seemingly high quality (e.g. sensitivity, specificity and AUC of almost 1) fail in real applications?

# This happens especially if the prevalence of one of the classes is rather low, e.g. for a

# rare disease (small positive class). In the disease example, even with high specificity

# values, one gets numerous false positives due to the very high number of negative cases.

# In other words, the PPV will be low.

Last changed2 years ago