When do we use a line plot for visualizing data?
To show a connection between a series of individual data points
To show a correlation between two quantitative variables
is not correct. We use scatter plots for showing correlations # between two quantitative variables
To highlight individual quantitative values per category
is not correct. We use barcharts for this
To compare distributions of quantitative values across categories
We use boxplots or violin plots
# for comparing distributions across categories
What’s the result of the following command?
ggplot(data = mpg)
ggplot
(data = mpg)
a blank figure will be produced
3. What’s the result of the following command?
ggplot(data = mpg, aes(x = hwy, y = cty))
(data = mpg,
aes
(x = hwy, y = cty))
a blank figure with axes will be produced
For which type of data will boxplots produce meaningful visualizations?
a. For discrete data.
b. For bi-modal distributions.
c. For non-Gaussian, symmetric data.
d. For exponentially distributed data.
# Answers C and D are correct
# Answer B is incorrect since boxplots are not good for bimodal # data since they only show the median and not both modes,
# Boxplots are ok for both symmetric and non-symmetric data, # since the quartiles are not symmetric.
Observe the following plot and select the correct answer.
# The correct answer is B. # To make the plot more readable use a log scale on the x-axis. ggplot(gm_dt, aes(pop, lifeExp, color=continent)) + geom_point() + scale_x_log10()
(gm_dt,
(pop, lifeExp, color=continent))
+ geom_point
()
+ scale_x_log10
boxplot
shows distribution and quantiles, especially useful when comparing uni-modal distributions.
bar chart
highlights individual values, supports comparison and can show rankings or deviations categories and totals
line chart
shows overall changes and patterns, usually over intervals of time
scatterplot
shows relationship between two continuous variables.
data(mpg) mpg <- as.data.table(mpg)
data
(mpg) mpg <-
as.data.table
(mpg)
ggplot(mpg, aes(cty, hwy, color=factor(year))) + geom_point() + geom_smooth(method=’lm’)
(mpg,
(cty, hwy, color=
factor
(year)))
+ geom_smooth
(method=’lm’)
How are the lengths and widths of sepals and petals distributed? Make one plot of the distributions with multiple facets. Hint: You will need to reshape your data so that the different measurements (petal length, sepal length, etc.) are in one column and the values in another. Remember which is the best plot for visualizing distributions.
Visualize the lengths and widths of the sepals and petals from the iris data with boxplots.
Add individual data points as dots on the boxplots to visualize all points. Discuss: in this case, why is it not good to visualize the data with boxplots?
# Solution
iris_melt <- melt(iris, id.var=c("Species"))
iris_melt <-
melt
(iris, id.var=
c
("Species"))
iris_melt %>%
iris_melt
%>%
ggplot(aes(value)) +
(
(value))
+
geom_histogram() +
geom_histogram
facet_wrap(~variable)
facet_wrap
~
variable)
ggplot(iris_melt, aes(variable, value)) + geom_boxplot()
# petal distributions are bimodal, boxplot cannot visualize this property.
p <- ggplot(iris_melt, aes(variable, value)) + geom_boxplot(outlier.shape = NA)
p <-
(iris_melt,
(variable, value))
+ geom_boxplot
(outlier.shape = NA)
p + geom_jitter(width = 0.3, size = .5)
p
+ geom_jitter
(width = 0.3, size = .5)
p + geom_dotplot(binaxis="y", stackdir="center", dotsize=0.3)
+ geom_dotplot
(binaxis="y", stackdir="center", dotsize=0.3)
ary the number of bins in the created histogram.
## With very few bins, we cannot show the bimodal distribution correctly.
iris_melt %>% ggplot(aes(value)) + geom_histogram(bins=5) + facet_wrap(~variable)
%>% ggplot
+ geom_histogram
(bins=5)
+ facet_wrap
## With too many bins, the plot looks spiky
iris_melt %>% ggplot(aes(value)) + geom_histogram(bins=100) + facet_wrap(~variable)
(bins=100)
Alternatives to boxplot are violin plots (geom_violin()). Try combining a boxplot with a violinplot to show the the lengths and widths of the sepals and petals from the iris data.
Which pattern shows up when moving from boxplot to a violin plot? Investigate the dataset to explain this kind of pattern, provide with visualization.
ggplot(iris_melt, aes(variable, value)) + geom_violin() + geom_boxplot(width=0.03)
+ geom_violin
(width=0.03)
# Overlay boxplot to visualize median and IQR.
# We see that petal length and petal width are bimodal. # As the iris data set has 3 species, the different belong # to the different species, so we can color the dots by Species.
ggplot(iris_melt, aes(variable, value, color = Species)) +
(variable, value, color = Species))
geom_dotplot(binaxis="y", stackdir="centerwhole", dotsize=0.3)
geom_dotplot
(binaxis="y", stackdir="centerwhole", dotsize=0.3)
Are there any relationships/correlations between petal length and width? How would you visually show it?
Do petal lengths and widths correlate in every species? Show this with a plot.
# Yes, they correlate. We use a scatter plot for showing this:
ggplot(iris,aes(Petal.Length,Petal.Width)) + geom_point()
(iris,
(Petal.Length,Petal.Width))
# They correlate on every species, add color or facets with respect to ‘Species‘
## With coloring
ggplot(iris,aes(Petal.Length,Petal.Width, color=Species)) +
(Petal.Length,Petal.Width, color=Species))
geom_point() +
geom_point
labs(x = "Petal Length", y = "Petal Width", title = "Relationship between petal length and width") +
labs
(x = "Petal Length", y = "Petal Width", title = "Relationship between petal length and width")
theme(plot.title = element_text(hjust=0.5))
theme
(plot.title =
element_text
(hjust=0.5))
# With facets
ggplot(iris,aes(Petal.Length,Petal.Width)) +
facet_wrap(~Species, scales = ’free’)
Species, scales = ’free’)
# scales = ’free’, relax axis in each plot to fit its own data.
log scaling
# Problem: # There are two countries with much larger populations than the rest.
# This ‘distorts’ the plot somewhat, in that a lot of the remaining points are bunched together
# Solution: log scaling
ggplot(medals_dt, aes(population, total)) + geom_point() + scale_x_log10() + scale_y_log10()
(medals_dt,
(population, total))
+ scale_y_log10()
Add the country labels to the points in the scatter plot
# Overlapping labels
ggplot(medals_dt, aes(population, total)) +
scale_x_log10() +
scale_x_log10
scale_y_log10 geom_text(aes(label=code))
scale_y_log10 geom_text
(label=code))
# Non-overlapping labels with ggrepel
library(ggrepel)
library
(ggrepel)
scale_y_log10
geom_text_repel(aes(label=code))
geom_text_repel
Compute the mean and standard deviation of each variable for each group
For each dataset, what is the Pearson correlation between x and y? Hint: cor() and Wikipedia1 for Pearson correlation.
Only by computing statistics, we could conclude that all 4 datasets have the same data. Now, plot x and y for each dataset and discuss.
# Use the functions ‘mean()‘ and ‘sd()‘ and create new columns
anscombe_reshaped[, .(x_mean = mean(x), y_mean = mean(y), x_sd = sd(x), y_sd = sd(y)),
anscombe_reshaped[, .(x_mean =
mean
(x), y_mean =
(y), x_sd =
sd
(x), y_sd =
(y)),
by = "group"]
# Group by ‘group‘ and use the function ‘cor()‘
anscombe_reshaped[, .(correlation = cor(x, y)), by = ’group’]
anscombe_reshaped[, .(correlation =
cor
(x, y)), by = ’group’]
# It’s always important to plot the raw data! # Different distributions can have the same mean and sd.
ggplot(anscombe_reshaped, aes(x, y)) +
(anscombe_reshaped,
(x, y))
facet_wrap(~ group)
group)
aufbau boxplot
What could make k-means clustering fail?
k-means fails when the input variables are on very different scales as it will skew the mean.
(0 + 4)/(5 * 4) / 2} = 0.4
Which of the following vectors corresponds to PC1?
# Both 1 and 3 can be correct. 2 would correspond to PC2.
# A is not possible because the maximum variance that can be explained is 100%
# B is not possible because by definition PC1 will always explain the most variance
# C is in principle possible but since our data is 4-dimensional, we will have 4 principal components and it is very unlikely that PC3 and PC4 do not capture any variance
# D is possible
We are interested in the correlations between genes. Plot the pairwise correlations of the variables in the dataset. Which pair of genes has the highest correlation?
# Plot pairwise correlations
ggcorr(expr[, -"tumor_type"])
Visualize the raw data in a heatmap with pheatmap
Does the latter plot suggest some outliers? Could they have affected the correlations? Check by using an appropriate plot the impact of these outliers on the correlations in question 1. Substitute them with missing values (NA) and redo the previous questions 1 and 2.
pheatmap does not work well with data.tables, you should therefore convert it to a matrix before plotting with as.matrix()
expr_mat <- as.matrix(expr[, -"tumor_type"])
rownames(expr_mat) <- expr[, tumor_type]
pheatmap(expr_mat, cluster_rows = F, cluster_cols = F)
# We can see that all the values are now in a similar range.
Consider the full iris data set without the Species column for clustering. Create a pretty heatmap with the library pheatmap of the data without clustering.
Now, create a pretty heatmap using complete linkage clustering of the rows of the data set.
pheatmap(plot.data, show_rownames=F, scale='column', clustering_method = "complete")
Obtain the dendrogram of the row clustering using complete linkage clustering and partition the data into 3 clusters.
Annotate the rows of the heatmap with the Species column of the iris dataset and the three clusters from complete linkage clustering. What do you observe when you compare the clustering and the species labels?
## pheatmap() returns an object with dendrograms
h_complete <- pheatmap(plot.data, show_rownames=F, scale='column',
clustering_method = "complete", silent=T)
# silent=T prevents heatmap to be displayed again
complete <- cutree(h_complete$tree_row, k = 3)
## label the row names to be able to annotate rows
rownames(plot.data) <- 1:nrow(plot.data)
## create a data.frame for the row annotations
row.ann <- data.table(Species = iris$Species) row.ann[, complete:=factor(complete)]
row.ann <- data.table(Species = iris$Species) row.ann[, complete:
=
factor(complete)]
# the clusters need to be factors
## plot the heatmap with complete linkage clustering
pheatmap(plot.data, annotation_row = row.ann, show_rownames=F, scale='column',
clustering_method = "complete")
1. Compute the Rand index between the two following clustering results from two different clustering algorithms
This plot represents a random initialization of a k-means algorithm with k=2. X1, X2 are the randomly positioned centroids and A to E are the points of the 2-dimensional dataset. Calculate the new positions of the centroids after the first iteration using the euclidean distance.
Perform k-means clustering on the iris data set with k = 3.
Create a pretty heatmap using complete clustering of the rows annotated with the species and both clustering results - complete linkage clustering and the k-means clustering. What do you observe when you compare the two different clustering algorithms and the species labels?
Compute the pairwise Rand indices between the clustering results from the previous sections (complete, average and k-means) and species label.
Visualize the pair wise Rand indices with a pretty heatmap. What is the best clustering in this scenario according to the computed Rand indices?
pheatmap(rand, cluster_cols = F, cluster_rows = F)
Let X be the iris data set without the Species column and only for the species setosa. Perform PCA on X. Make sure that you scale and center the data before performing PCA.
Which proportion of the variance is explained by each principle component?
Compute the projection of X from the PCA result and plot the projection on the first two principle components.
# we can look at the explained variance with the summary on the results
summary(pca)
Plot the first principal component against the other variables in the dataset and discuss whether this supports your previously stated interpretation.
pc_iris <- cbind(iris_dt[Species == "setosa"], proj) pc_iris <- melt(pc_iris,
id.vars = c("Species", 'PC1', 'PC2', 'PC3', 'PC4'))
ggplot(pc_iris, aes(value, PC1)) + geom_point() + facet_wrap(~variable, scales = 'free')
Last changed2 years ago