Which of the following claims could be an example of reversing cause and effect?
1. Healthier diets increase blood pressure.
2. Low social status leads to a higher risk of schizophrenia.
3. The number of fire engines on a fire gives rise to higher damages.
4. Entering an intensive care unit increases your chances of dying.
#All previous statements could be examples of reversing cause and effect...
#There is no firm way to decide the direction of causality from a mere
#association. It is therefore important to consider and discuss both
#possibilities when interpreting a correlation.
A study conducted in Datavizland by Woman’s magazine analyzed the dating preference of women.
A study conducted in Datavizland by Woman’s magazine analyzed the dating preference of women. For each previous or current partner, women were asked to evaluate men on two parameters from 0-10: How handsome and how fun they are.
The study concluded: ” We report a negative correlation between how handsome and how fun a man is , therefore we conclude that handsome men are boring”.
In the same week a study conducted on the whole population of men by Men’s magazine reported no correlation between how handsome and how fun a man is.
How can you explain this apparent paradox? Which kind of causal diagram describes this scenario?
1. Common cause 2. Indirect cause 3. Common consequenc
Es gibt:
1. Common cause 2. Indirect cause 3. Common consequence
# Let's suppose women only tend to date men which are either fun or handsome
#(ex. men whose fun+handsome classification >=10). Then, when subsetting on
#the men who are dated you can see a negative correlation between the attributes
#handsome and fun, while it not being the case when considering the whole
#population. This is a common consequence scenario where Z=Man picked as
#partner, X=How fun, Y=How handsome (See Fig. 6.3 of the lecture script).
The ministry of health of Datavizland observed a positive correlation between the liters of water drunk per day and sunburns. Which kind of causal diagram may best describe this scenario?
# This is a common cause scenario where Z=Sunny day,
#X=Liters of water drunk per day, Y=Sunburns
#(See Fig. 6.3 of the lecture script)
Which item is no chart junk?
1. A bright red plot border
2. Light grey major grid lines
3. Bold labels and grid lines
4. Data labels in Batik Gangster font
#Bright colors and bold text draw attention from the data and decrease the
#data/ink ratio. Standard fonts like Arial or Helvetica are preferred.
#Grid lines help visualizing data point values
#and in light grey do not draw attention from the data.
What are best practices when using color for data visualizations?
1. Avoid having too many colors for categorical data.
2. Use color only when it actually adds meaning to the plot.
# Correct are 1, 2
NOT 3. Use divergent color scales for categorical data types.
The concept of reverse causality states that whenever A causes B, B also causes A.
T/F?
False. Reverse causality states that if A and B correlate
#one might falsely interpret that A causes B even though in reality B causes A.
If A causes B and A causes C, then B also causes C.
False. If A causes B and A causes C,
#B and C can correlate (see lecture). However, from the fact that B and C corre-
#late,one cannot conclude that B causes C(correlation does not imply causation).
If A and B correlate and A happens before B, then A causes B.
# 3 - False. Counterexample: Everyday, the rooster crows just before sunrise.
# Therefore, the rooster crowing and the sunrise correlate. However,
# even though the rooster crows just before sunrise, the rooster crowing
# does not cause the sun to rise.
Causation implies linear association.
This statement is false. If A causes B, then there is an association
#between A and B. However, this association between A and B does not need
#to be linear but could for example be exponential or quadratic.
Suggest an appropriate visualization and implement it with ggplot2 to display a possible association between coffee consumption and “datavizitis” disease risk, measured in deaths per 1000 individuals. Does this plot by itself seem consistent with a causal effect of coffee on datavizitis?
Investigate the full dataset. Do you see evidence for a third variable influencing association? Support your statement with an appropriate plot. Draw a graph with the potential causal relationships you find consistent with the data. Relate it to one of the situations from the lecture script’s figure 6.3 or Simpson’s paradox.
# Taken by itself, the plot seems consistent with a causal effect on datavizitis.
ggplot(coffee_dt, aes(coffee_cups_per_day, datavizitis_risk)) + geom_boxplot() + labs(x = "Cups of coffee per day", y = "Deaths per 1,000")
ggplot
(coffee_dt,
aes
(coffee_cups_per_day, datavizitis_risk))
+ geom_boxplot
()
+ labs
(x = "Cups of coffee per day", y = "Deaths per 1,000")
# This is the way it looks for smoking
ggplot(coffee_dt, aes(packs_cigarettes_per_day, datavizitis_risk)) + geom_boxplot() + labs(x = "Packs of cigarette per day", y = "Deaths per 1,000")
(packs_cigarettes_per_day, datavizitis_risk))
(x = "Packs of cigarette per day", y = "Deaths per 1,000")
# and this is the proper way to look at it, # coffee effects are always the same within each smoking group.
ggplot(coffee_dt,
aes(packs_cigarettes_per_day, datavizitis_risk, fill = coffee_cups_per_day)) + geom_boxplot() + labs(x = "Packs of cigarette per day", y = "Deaths per 1,000") + guides(fill = guide_legend(title = "Cups of coffee"))
(packs_cigarettes_per_day, datavizitis_risk, fill = coffee_cups_per_day))
+ guides
(fill =
guide_legend
(title = "Cups of coffee"))
# But the effect of smoking is not the same within each # coffee consumption group.
ggplot(coffee_dt, aes(coffee_cups_per_day, datavizitis_risk, fill = packs_cigarettes_per_day)) + geom_boxplot() + labs(x = "Cups of coffee per day", y = "Deaths per 1,000") + guides(fill = guide_legend(title = "Packs of cigarettes"))
(coffee_cups_per_day, datavizitis_risk, fill = packs_cigarettes_per_day))
(title = "Packs of cigarettes"))
Discuss in groups what could be better representations
Simpson’s paradox.
Visualize the relationship between the number of cigarettes smoked per day and datavizitis severity among hospitalized individuals
-> highlight general trend
Visualize the relationship between datavizitis severity and cigarettes smoked per day among all population.
Visualize the same relationship distinguishing between hospitalized and all individuals.
Recent studies have looked at hospitalized patients who tested positive for Covid19 and their smoking status. They propose smoking may provide a lower risk of developing severe Covid19 based on a negative association between Covid19 severity and smoking status. Considering the previous results on datavizitis can you come up with a different explanation? Draw a graph with the potential causal relationships you find consistent with the data. Relate it to one of the situations from the lecture script’s figure 6.3 or Simpson’s paradox.
Does age associate with survival? Make a plot showing the distribution of age per survival outcome.
Visualize the relationship between passenger class and survival rate.
How is age distributed in each passenger class?
Considering the passenger class, do age and survival outcome associate? Given the findings on question 4, comment on the results. Draw a graph with the potential causal relationships you find consistent with the data. Relate it to one of the situations from the lecture script’s figure 6.3 or Simpson’s paradox.
ggplot(titanic, aes(x = factor(pclass), fill = factor(survived))) + geom_bar(position = "fill")
(titanic,
(x =
factor
(pclass), fill =
(survived)))
+ geom_bar
(position = "fill")
The null hypothesis of a study states ‘Both genders of US Olympic competitors are equally likely to win gold medals at the Olympics’. What are possible one-tailed alternative hypotheses? More than one answer can be correct.
a) There are gender differences in US Olympic competitors in the number of gold medals won at the Olympics.
b) Female US Olympic competitors win more gold medals at the Olympics.
c) Male US Olympic competitors win more gold medals at the Olympics.
# Answers b and c are correct. Answer a provides a two-sided hypothesis.
2. The p-value of a certain hypothesis test is 0.007. What can the researcher conclude? More than one answer can be correct.
a) If the study is repeated 1000 times, 7 times will fail to produce a significant result.
b) We can be only 0.7% confident that the null hypothesis is true.
c) The effect size of the study is large.
d) Assuming the null hypothesis is true, the probability to make this extreme or more extreme observation
is 0.7%.
Assuming the null hypothesis is true, the probability to make this extreme or more extreme observation
Let X be a vector that contains some collected data. Which of the following lines of code produces one sample of a case resampling bootstrap? More than one answer can be correct.
a) sample(X, size = length(X), replace = T)
b) sample(X, size = length(X), replace = F)
c) sample(X, size = length(X), replace = T, prob = 1/length(X))
d) sample(X, size = length(X), replace = T, prob = rep(c(0.2, 0.5), each = length(X) / 2))
# Correct are a and c. Both sample with replacement from all observations with equal probability.
1. We just concluded that both markers 5211 and 5091 are significantly associated with growth. However, this could be confounded. A common source of confounding in genomics is due to “linkage”, which describes the phenomenon of markers being inherited together. A biological explanation for linkage is provided here: https://www.khanacademy.org/science/biology/classical-genetics/chromosomal-basis-of- genetics/a/linkage-mapping
To investigate the issue of linkage in our dataset, test if marker 5091 significantly associates with marker 5211. Define a null hypothesis, a statistic and use permutation testing to answer the question. Strengthen your answer with a relevant plot.
Now, we would like to know if marker 5091 still associates with growth in maltose (YPMalt) when conditioned on marker 5211. Define a null hypothesis, a statistic and use permutation testing to answer the question. Strengthen your answer with a relevant plot.
Now, test if marker 5211 associates with growth in maltose when conditioned on marker 5091. Are the results the same? Discuss.
p_val_condition_on(test_mrk = "mrk_5211", condition_mrk = "mrk_5091")
Estimate 95% equi-tailed confidence intervals for the difference of the medians of growth in maltose for each genotype at marker mrk_5211. Use the case resampling bootstrap scheme and report bootstrap percentile intervals. Propose a visualization of the results. Try it also with markers 5091 and 1653.
# (a) is wrong. p is not random and thus it either is in the interval or it is not. # (b) and (c) are wrong for the same reason as (a). # Moreover, we did not even specify a null hypothesis # This is not necessary to compute a confidence interval.
# (d) is correct. In fact, this is a restatement of the definition of a 95% confidence interval
# (a) is correct
# (b) is wrong. The alternative=greater # asks whether the upper left cell
# is significantly bigger than we expect
# This is not the case here
# (c) is correct
Look at the following plot, where y = x^4. Indicate which, if any, of the following statements about correlation are correct:
You are a data science consultant helping researchers pick the right tests to evaluate their hypotheses. For each hypothesis, indicate which test from the ones you have seen in the lecture would be most appropriate:
A researcher collects data on the height (measured in cm) and weight (measured in g) of Germans. She hypothesizes that there is a significant association between how tall Germans are and how much they weigh. She would like to test this hypothesis without making any distributional assumptions.
Both variables are quantitative and we do not assume a normal distribution.
# Thus a Spearman correlation test is the right choice.
A researcher collects data on the weight (measured in g) of Bavarians before and after the Oktoberfest. She would like to know whether there is a significant difference in average weight after the Oktoberfest as compared to before it. Prior research indicates that the weight of Bavarians is approximately normally distributed.
Weight is quantitative whereas before/after the Oktoberfest is binary.
# As we assume normality, a two-sample t-test is most appropriate.
A researcher is evaluating a rapid antigen test. The company manufacturing the test claims that if someone is infected, the test will correctly return a positive result 99% of the time. The researcher hypothesizes that, in practice, the test is often improperly administered and therefore significantly less sensitive. She asks 1000 individuals, which have all been confirmed to be infected by a PCR test, to self-administer the antigen test. She records how often the antigen test correctly returns a positive result.
Here we have one binary variable (is the test right or wrong?).
# We use a binomial test. # We would set p = 0.99 as our null hypothesis
# and use a one sided test (alternative = "less")
The company manufacturing the test has collected a bigger dataset, comprising both infected and non-infected individuals. For each individual, they record two datapoints: the result of a PCR test (infected/not-infected), which is taken as ground truth, and the result of a self-administered antigen test (positive/negative). They would like to show that, even if self-administered, the test still gives some information about infection status and thus is better than nothing.
We want to know if there is a significant association between two binary variables.
# We use a Fisher test.
# It makes sense to restrict to alternative = "greater" here
# since we expect a positive association between test and infection-status
# But: what if the test showed a strong negative association with infection-status?
# i.e. it is negative whenever a person is infected and vice versa # Could this still be a good test?
Wilcoxon Rank Sum Test
t test
Pitfalls:
Pearson’s product-moment correlation
spearman
cor_value = cor.test(iris$Sepal.Length, iris$Sepal.Width, method = "spearman")
cor_value
In dataviz land, we want to know whether there is correlation between attendance to the exercise sessions and the points achieved in the final exam. We provide simulated data below. Load the data from exam_correlation.tsv. Calculate the correlation between attendance and points using Pearson and Spearman methods and visualize it. Some students will drop out of the distribution since they were planning to take the retake exam and skipped the first exam, thus obtaining a grade of zero. Which correlation method should be preferred in this context and why?
Consider the dataset mtcars. Which statistical test that we studied do you suggest to test the association between the variable cylinder > 4 and the variable gear > 3? Justify the choice of the test and provide the two-sided p-value rounded to two significant digits using signif(...,digits=2)
Assume that α = 0.05 is our threshold of significance. What did we show in part (1) of this exercise?
If (1) had asked us to “test if there is a positive association between the variable cylinder > 4 and the variable gear > 3”, how would our answer change? What do we conclude?
If (1) had additionally specified “do not make any assumption of normality”, how would our answer change?
## We showed that we can reject the null hypothesis
## of no association between these two variables.
## This is irrelevant for a Fisher test ## We don't need to change anything
Test the association between markers
The data follows a uniform distribution on the interval [0, 1] with outliers at 0
The data follows a uniform distribution on the interval [−1, 1]
The data follows a uniform distribution on the interval [0, 1]
The data follows a normal distribution with μ = 0.5 and σ = 0.5
Say we perform a Spearman correlation test for two variables. The test returns P < 10−26 (i.e. values as or more extreme than the ones observed are astronomically unlikely under the null hypothesis). Indicate which, if any, of the following statements is correct:
a) The two variables necessarily show a strong positive correlation (i.e. ρ > 0.5)
b) The two variables necessarily show a strong negative correlation (i.e. ρ < 0.5)
c) The two variables necessarily are strongly correlated (i.e. |ρ| > 0.5) d) Surely something very important has been discovered
# All statements are wrong
# Given a large enough sample size
# Even a tiny correlation
# can be significant
# Also, even if the correlation
# is huge, it might not be important
# Not breathing correlates very strongly
# with being dead, but this is trivial
A simple yet effective strategy to control for multiple testing is to only reject the null hypothesis when P = 0, i.e. when we are sure of the association
is wrong. This is not effective
# It means we never reject the null
# because P-values should never be zero
Assume the null hypothesis is always true. If we do 200 tests, we will expect to have around 10 false positives when using α = 0.05 as our threshold of significance
is correct.
Assume the null hypothesis is always false. If we do 800 tests, we will expect to have around 80 false positives when using α = 0.1 as our threshold of significance
is wrong.
# If the null hypothesis is always false
# we cannot falsely reject it
Assume the null hypothesis is sometimes false. Then the P-values will follow a normal distribution
is wrong. They will deviate
# from the uniform distribution
# but they won't follow a normal
Assume we are doing 1000 tests. Indicate which, if any, of the following statements concerning Bonferroni and Benjamini-Hochberg are correct:
Assume we are using a permutation-based approach. If we use a Bonferroni correction, we will in general need to do more permutations to be able to reject the null than if we used a Benjamini-Hochberg correction.
# The bonferroni is generally
# more conservative
# Thus it demands lower P-values to reject
# Lower P-values require more permutations
If we let α = 0.01 and use a Bonferroni correction, then the probability of one or more false positives (falsely rejecting the null) will be less than 1%
is correct. That is the definition
If we let α = 0.01 and use a Benjaminin-Hochberg correction, then in expectation 1% of the tests we perform will reject the null
is wrong. We expect that less than 1% of the
# positives will be false positives
2. Now ensure that, on average, less than 5% of the significant associations you find are false positives.
Now ensure that the probability of having 1 or more false positives is less than 5%
If all tests are truly under the null hypothesis, the distribution of the P -values should be uniform by definition.
Please plot the P-values for sample_size = 50 with the provided function. Discuss.
Correct for multiple testing
Adjust P-values with the different methods seen in the class. Plot the results using the plot function. Do they behave as expected? Discuss.
pvals0_B <- p.adjust(pvals0, m = 'bonferroni')
pvals0_B <-
p.adjust
(pvals0, m = 'bonferroni')
plot_pval(pvals0_B, title = "\nBonferroni adj p-values")
plot_pval
(pvals0_B, title = "\nBonferroni adj p-values")
## The higher the number of observations the lower the p-value gets. ## This means a tiny differences can be found significant ## if one has enough observations.
Mixture of H0 and H1 adjusted for multiple testing
Adjust the p-values with Benjamini-Hochberg (FDR) in the mixture from the previous question. Make a contingency table of true positives, true negatives, false positives and false negatives. Try this with different sample sizes for FDR = 0.05. Discuss.
Do the same thing for the bonferroni correction and compare the results
0
What can linear regression be used for?
1 - Make predictions for future, unseen, data
2 - Quantify explained variance
3 - Model linear relationship between variables
What are the implications of Heteroscedascity on linear regression. Check all that are true:
1 - The fit can be suboptimal because the least squares errors give too much importance to the points with high noise
2 - The statistical tests are flawed
3 - The coefficient estimates are biased
#Statement 1 is correct. Statement 2 is correct because heteroscedascity #violates the i.i.d assumption of the errors and therefore the statistical #tests will be flawed. Statement 3 is false, the estimated coefficients still #converge to the true ones.
Predict each student’s height, given their sex and their parents heights.
Check the plot of the residual vs the predicted values and the Q-Q plot of the residuals. Do these plots provide evidence against the assumptions of linear regression?
m <- heights[, lm(height ~ sex + mother + father)]
summary(m)
prediction = data.table(prediction = predict(m), residuals = residuals(m))
ggplot(prediction, aes(prediction, residuals)) + geom_point() + geom_hline(yintercept = 0)
ggplot(prediction, aes(sample = residuals)) + geom_qq() + geom_qq_line()
Run a linear model predicting the growth given the genotypes of both markers and interpret the result. Call this model full.
table <- merge(growth, genotype)
full <- table[, lm(growth_rate ~ mrk_5211 + mrk_5091)]
summary(full)
Create a reduced model that only depends on the genotype of mrk_5211. Then run ANOVA to compare the full and the reduced model. Suppose that all the assumptions of linear regression hold. What do you conclude?
1. Fit three linear models predicting Sepal.Width from Sepal.Length: the base model that simply predicts sepal width from sepal length, one where you use the species as a covariate in linear regression (i.e., different intercept for different species) and one where you use separate slopes and intercepts for different species by using the * operator in lm: lm(y ~ Sepal.Length * Species).
What are the slopes and intercepts of each one of the species for the model with separate slopes and intercepts?
Overlay the resulting fits on the plot above
Welches ist am besten?
Use anova to test if the second model is a better model than the base and also if the third model is better than the second. Suppose all the assumptions of linear regression hold.
## note: in both cases, using a more complex model allowed to fit the data better
1. Which of the following is true for logistic regression?
a) It assigns classes to the datapoints.
b) There is an analytical solution for the estimation of the parameters.
c) It predicts probabilities for each of the two classes.
Suppose you are given a fair coin, p(heads) = 0.5. Which of the following are true about odds and log-odds of head?
a) The odds are 0, and the log-odds are 1.
b) The odds are 0.5, and the log-odds are approximately -0.693.
c) The odds are 1, and the log-odds are 0.
d) The odds are 1, and the log-odds are 1.
The odds are 1, and the log-odds are 0.
Let sigm() denote the sigmoid function. Which of the following statements are possible for some value of x or y?
a) sigm(x) = 10
b) sigm(10) = y
c) sigm(x) = - 1
d) sigm(x) = 0.5
sigm(10) = y
sigm(x) = 0.5
We are tasked with fitting a logistic regression to detect possible bank frauds that will be further investigated by the bank. Bank frauds are rare but when they occur can be very costly for the bank. Moreover, dealing with false alarms by manual inspection is not too costly. Which of the following is preferable?
High recall
How balanced are the classes of the diabetes dataset?
diabetes_dt[, .N, by=Outcome] # absolute numbers for each class
diabetes_dt[, .N/nrow(diabetes_dt), by=Outcome] # class proportions
Create an appropriate plot to visualize the relationship between the Outcome variable and the feature variables Glucose, BloodPressure and Insulin. What do you conclude from your visualization?
Fit a logistic regression model for predicting Outcome only based on the feature Glucose. Inspect the coefficients of the model’s predictors. According to the model, how much do the odds of getting diabetes increase upon increasing the blood glucose level by 1 mg/dL?
# Fit logistic regression models for Glucose
logreg_1 <- glm(Outcome~Glucose, data = diabetes_dt, family = "binomial")
# Have a first look at the output of the model
ggplot(diabetes_dt, aes(preds_model1, fill = Outcome)) + geom_histogram(position="dodge")
Now, create a function for computing the confusion matrix based on the predicted scores of a model and the actual outcome. The function takes as input a threshold, a data table, the name of a scores column and the name of column with the actual labels. Then, use the implemented function for computing the confusion matrix of the model for the thresholds -1, 0 and 1. Are there any differences? What is the amount of false positives for the last cutoff? You can use the following definition of the function:
confusion_matrix <- function(dt, score_column, labels_column, threshold){ }
confusion_matrix <-
function
(dt, score_column, labels_column, threshold){ }
Use the implemented function to create a second function for this time computing the TPR and FPR for a certain threshold of a classification model given the predicted scores of a model and the actual outcome. What is the TPR and the FPR of the first model for the thresholds -1, 0 and 1? Plot these values in a scatter plot. Your function should take the same parameters as before and return a data table as follows:
confusion_matrix <- function(dt, score_column, labels_column, threshold){
(dt, score_column, labels_column, threshold){
# The table() function is very useful for computing the confusion matrix
# We have to use get() to get the column from a string
return(dt[, table(get(labels_column), get(score_column)>threshold) ])
}
Create two further logistic regression models as in section 1.3 for predicting Outcome. For one model, use only the feature variable BloodPressure for building the model. For the other model, use only the feature variable Insulin. Which models have a significant feature?
Collect the predictions of each model for all samples in the dataset. Store the scores of each model in a separate column of the original dataset. Visualize the distributions of the scores with an appropriate plot. Which type of distribution would you ideally expect?
# Fit two further models with different features
logreg_2 <- glm(Outcome~BloodPressure, data = diabetes_dt, family = "binomial")
logreg_3 <- glm(Outcome~Insulin, data = diabetes_dt, family = "binomial")
logreg_2
diabetes_dt[, preds_model1 := predict(logreg_1)]
diabetes_dt[, preds_model2 := predict(logreg_2)]
diabetes_dt[, preds_model3 := predict(logreg_3)]
diabetes_dt
For a systematic comparison of the previously built three models, plot a ROC curve for each model into a single plot using the function geom_roc from the library plotROC. Add the area under the curve (AUC) to the plot. Which is the best model according to the AUC?
Now, fit a logistic regression model with all feature variables (stored in feature_vars). Visualize the distribution of the predicted scores for positive and negative classes. What can you conclude from this visualization regarding the separation of the two classes by the model? Plot once again the previous ROC curves and include the ROC curve of the full model for comparison.
b) The model underfits the data.
c) The model fits the data appropriately.
a) The model overfits the data.
We train a binary classification model that predicts for a given position (x, y) if a hole made by a fork in cake dough exists. Why does using cross-validation fail here?
b) The position of each hole depends on the position of other holes.
NOT
a) Cross-validation only works for one-dimensional data points.
c) There are not enough data points for using cross validation.
d) Cake dough (data) can only be eaten, not cross-validated.
A random forest consists of many decision trees that are trained on the same data set. How is randomness used to prevent over fitting?
Randomness is introduced by using different subsets of the original data set for training each decision tree.
Randomness is introduced by using different feature sets of the original data set for training each decision tree.
Randomness is introduced by CPU concurrency - parallel threads allow for tiny differences in feature importance.
A supervised learning algorithm does not require data labeled with an outcome.
T/F
# a) False. Per definition, supervised learning requires features and an outcome.
When performing supervised training on two classes, the train set should only contain samples from one of these two classes. The test set should then only contain the class not used in the train set.
False. The train and test set then don't contain samples from the same distribution.
Models trained using cross validation do not over-fit.
False. Cross-validation does not necessarily prevent overfitting.
A regression model that fits the train set perfectly, i.e. the train error is practically 0, can be used without further caution to collect predictions samples from any other dataset.
False. The model might be overfit on the train set.
Build a decision tree using the rpart function from the library rpart for predicting the Outcome given all feature variables. Use the following command for this:
Plot a ROC curve for the decision tree using the function geom_roc from the library plotROC. What do you conclude about the performance of the decision tree based on this plot?
# Save predictions
diabetes_dt[, preds_dt := predict(dt_classifier, type="prob")[,2]]
diabetes_dt[, preds_dt :
=
predict(dt_classifier, type="prob")[,2]]
Build a second decision tree model this time using a train-test split strategy. This means that you will use 70% of the data for training and 30% of the data for testing. Plot the ROC curves for the performance on the training and on the test dataset. What do you conclude from this?
In the lecture we learned that random forests are more robust to overfitting. Build a random forest using the randomForest function from the library randomForest for predicting the Outcome given all feature variables using the same train-test split strategy from before. Set the following values for the following hyper-parameters:
Implement a 5-fold cross-validation on the diabetes dataset for building a logistic regression model using all feature variables. Obtain 5-fold cross-validated sensitivity, specificity and AUC using the caret package.
What is the fold with the highest AUC?
Create a box plot displaying the specificity and sensitivity of the logistic regression model over all folds. Add the individual points to the box plot.
metrics_dt_melt <- melt(metrics_dt, id.vars = "Resample", variable.name = "metric")
metrics_dt_melt <- metrics_dt_melt[metric=="Sens" | metric == "Spec"]
ggplot(metrics_dt_melt,aes(x=metric, y = value)) + geom_boxplot() + geom_jitter()
Try changing the hyper-parameters of the random forest selected in Section 01 with the aim of achieving a better performance on both train and test sets evaluated with the same ROC curve as before. Two possible approaches for searching for optimal hyper-parameters are random and grid search. A short description of these approaches can be found here: https://web.archive.org/web/20160701182750/http://blog.dato.com/how-to-evaluate-machine-learning- models-part-4-hyperparameter-tuning
The aim of this section is to investigate the importance of each feature variable on the Diabetes dataset to predict the Outcome variable. In this scenario, we define the feature importance of a variable as the difference in the computed area under the ROC curve of a model trained without this variable. Compute the feature importance of every feature variable and visualize these computed quantities with a suitable plot.
Last changed2 years ago