undefined

Buffl

DataVis

by Lea H.

When do we use a line plot for visualizing data?

To show a connection between a series of individual data points

To show a correlation between two quantitative variables

is not correct. We use scatter plots for showing correlations # between two quantitative variables

To highlight individual quantitative values per category

is not correct. We use barcharts for this

To compare distributions of quantitative values across categories

We use boxplots or violin plots

# for comparing distributions across categories

What’s the result of the following command?

ggplot(data = mpg)

a blank figure will be produced

3. What’s the result of the following command?

ggplot(data = mpg, aes(x = hwy, y = cty))

a blank figure with axes will be produced

For which type of data will boxplots produce meaningful visualizations?

a. For discrete data.

b. For bi-modal distributions.

c. For non-Gaussian, symmetric data.

d. For exponentially distributed data.

a. For discrete data.

b. For bi-modal distributions.

c. For non-Gaussian, symmetric data.

d. For exponentially distributed data.

# Answers C and D are correct

# Answer B is incorrect since boxplots are not good for bimodal # data since they only show the median and not both modes,

# Boxplots are ok for both symmetric and non-symmetric data, # since the quartiles are not symmetric.

Observe the following plot and select the correct answer.

# The correct answer is B. # To make the plot more readable use a log scale on the x-axis. ggplot(gm_dt, aes(pop, lifeExp, color=continent)) + geom_point() + scale_x_log10()

boxplot

shows distribution and quantiles, especially useful when comparing uni-modal distributions.

bar chart

highlights individual values, supports comparison and can show rankings or deviations categories and totals

line chart

shows overall changes and patterns, usually over intervals of time

scatterplot

shows relationship between two continuous variables.

data(mpg) mpg <- as.data.table(mpg)

ggplot(mpg, aes(cty, hwy, color=factor(year))) + geom_point() + geom_smooth(method=’lm’)

How are the lengths and widths of sepals and petals distributed? Make one plot of the distributions with multiple facets. Hint: You will need to reshape your data so that the different measurements (petal length, sepal length, etc.) are in one column and the values in another. Remember which is the best plot for visualizing distributions.

Visualize the lengths and widths of the sepals and petals from the iris data with boxplots.

Add individual data points as dots on the boxplots to visualize all points. Discuss: in this case, why is it not good to visualize the data with boxplots?

# Solution

iris_melt <- melt(iris, id.var=c("Species"))

iris_melt %>%

ggplot(aes(value)) +

geom_histogram() +

facet_wrap(~variable)

ggplot(iris_melt, aes(variable, value)) + geom_boxplot()

# petal distributions are bimodal, boxplot cannot visualize this property.

p <- ggplot(iris_melt, aes(variable, value)) + geom_boxplot(outlier.shape = NA)

p + geom_jitter(width = 0.3, size = .5)

p + geom_dotplot(binaxis="y", stackdir="center", dotsize=0.3)

ary the number of bins in the created histogram.

## With very few bins, we cannot show the bimodal distribution correctly.

iris_melt %>% ggplot(aes(value)) + geom_histogram(bins=5) + facet_wrap(~variable)

## With too many bins, the plot looks spiky

iris_melt %>% ggplot(aes(value)) + geom_histogram(bins=100) + facet_wrap(~variable)

Alternatives to boxplot are violin plots (geom_violin()). Try combining a boxplot with a violinplot to show the the lengths and widths of the sepals and petals from the iris data.

Which pattern shows up when moving from boxplot to a violin plot? Investigate the dataset to explain this kind of pattern, provide with visualization.

ggplot(iris_melt, aes(variable, value)) + geom_violin() + geom_boxplot(width=0.03)

# Overlay boxplot to visualize median and IQR.

# We see that petal length and petal width are bimodal. # As the iris data set has 3 species, the different belong # to the different species, so we can color the dots by Species.

ggplot(iris_melt, aes(variable, value, color = Species)) +

geom_dotplot(binaxis="y", stackdir="centerwhole", dotsize=0.3)

Are there any relationships/correlations between petal length and width? How would you visually show it?

Do petal lengths and widths correlate in every species? Show this with a plot.

# Yes, they correlate. We use a scatter plot for showing this:

ggplot(iris,aes(Petal.Length,Petal.Width)) + geom_point()

# They correlate on every species, add color or facets with respect to ‘Species‘

## With coloring

ggplot(iris,aes(Petal.Length,Petal.Width, color=Species)) +

geom_point() +

labs(x = "Petal Length", y = "Petal Width", title = "Relationship between petal length and width") +

theme(plot.title = element_text(hjust=0.5))

# With facets

ggplot(iris,aes(Petal.Length,Petal.Width)) +

geom_point() +

facet_wrap(~Species, scales = ’free’)

# scales = ’free’, relax axis in each plot to fit its own data.

log scaling

# Problem: # There are two countries with much larger populations than the rest.

# This ‘distorts’ the plot somewhat, in that a lot of the remaining points are bunched together

# Solution: log scaling

ggplot(medals_dt, aes(population, total)) + geom_point() + scale_x_log10() + scale_y_log10()

Add the country labels to the points in the scatter plot

# Overlapping labels

ggplot(medals_dt, aes(population, total)) +

geom_point() +

scale_x_log10() +

scale_y_log10 geom_text(aes(label=code))

# Non-overlapping labels with ggrepel

library(ggrepel)

ggplot(medals_dt, aes(population, total)) +

geom_point() +

scale_x_log10() +

scale_y_log10

geom_text_repel(aes(label=code))

Compute the mean and standard deviation of each variable for each group

For each dataset, what is the Pearson correlation between x and y? Hint: cor() and Wikipedia1 for Pearson correlation.

Only by computing statistics, we could conclude that all 4 datasets have the same data. Now, plot x and y for each dataset and discuss.

# Use the functions ‘mean()‘ and ‘sd()‘ and create new columns

anscombe_reshaped[, .(x_mean = mean(x), y_mean = mean(y), x_sd = sd(x), y_sd = sd(y)),

by = "group"]

# Group by ‘group‘ and use the function ‘cor()‘

anscombe_reshaped[, .(correlation = cor(x, y)), by = ’group’]

# It’s always important to plot the raw data! # Different distributions can have the same mean and sd.

ggplot(anscombe_reshaped, aes(x, y)) +

geom_point() +

facet_wrap(~ group)

aufbau boxplot

What could make k-means clustering fail?

k-means fails when the input variables are on very different scales as it will skew the mean.

(0 + 4)/(5 * 4) / 2} = 0.4

Which of the following vectors corresponds to PC1?

# Both 1 and 3 can be correct. 2 would correspond to PC2.

# A is not possible because the maximum variance that can be explained is 100%

# B is not possible because by definition PC1 will always explain the most variance

# C is in principle possible but since our data is 4-dimensional, we will have 4 principal components and it is very unlikely that PC3 and PC4 do not capture any variance

# D is possible

We are interested in the correlations between genes. Plot the pairwise correlations of the variables in the dataset. Which pair of genes has the highest correlation?

# Plot pairwise correlations

ggcorr(expr[, -"tumor_type"])

Visualize the raw data in a heatmap with pheatmap

Does the latter plot suggest some outliers? Could they have affected the correlations? Check by using an appropriate plot the impact of these outliers on the correlations in question 1. Substitute them with missing values (NA) and redo the previous questions 1 and 2.

pheatmap does not work well with data.tables, you should therefore convert it to a matrix before plotting with as.matrix()

expr_mat <- as.matrix(expr[, -"tumor_type"])

rownames(expr_mat) <- expr[, tumor_type]

pheatmap(expr_mat, cluster_rows = F, cluster_cols = F)

# We can see that all the values are now in a similar range.

Consider the full iris data set without the Species column for clustering. Create a pretty heatmap with the library pheatmap of the data without clustering.

Now, create a pretty heatmap using complete linkage clustering of the rows of the data set.

pheatmap(plot.data, show_rownames=F, scale='column', clustering_method = "complete")

Obtain the dendrogram of the row clustering using complete linkage clustering and partition the data into 3 clusters.

Annotate the rows of the heatmap with the Species column of the iris dataset and the three clusters from complete linkage clustering. What do you observe when you compare the clustering and the species labels?

## pheatmap() returns an object with dendrograms

h_complete <- pheatmap(plot.data, show_rownames=F, scale='column',

clustering_method = "complete", silent=T)

# silent=T prevents heatmap to be displayed again

complete <- cutree(h_complete$tree_row, k = 3)

## label the row names to be able to annotate rows

rownames(plot.data) <- 1:nrow(plot.data)

## create a data.frame for the row annotations

row.ann <- data.table(Species = iris$Species) row.ann[, complete:=factor(complete)]

# the clusters need to be factors

## plot the heatmap with complete linkage clustering

pheatmap(plot.data, annotation_row = row.ann, show_rownames=F, scale='column',

clustering_method = "complete")

1. Compute the Rand index between the two following clustering results from two different clustering algorithms

This plot represents a random initialization of a k-means algorithm with k=2. X1, X2 are the randomly positioned centroids and A to E are the points of the 2-dimensional dataset. Calculate the new positions of the centroids after the first iteration using the euclidean distance.

Perform k-means clustering on the iris data set with k = 3.

Create a pretty heatmap using complete clustering of the rows annotated with the species and both clustering results - complete linkage clustering and the k-means clustering. What do you observe when you compare the two different clustering algorithms and the species labels?

Compute the pairwise Rand indices between the clustering results from the previous sections (complete, average and k-means) and species label.

Visualize the pair wise Rand indices with a pretty heatmap. What is the best clustering in this scenario according to the computed Rand indices?

pheatmap(rand, cluster_cols = F, cluster_rows = F)

Let X be the iris data set without the Species column and only for the species setosa. Perform PCA on X. Make sure that you scale and center the data before performing PCA.

Which proportion of the variance is explained by each principle component?

Compute the projection of X from the PCA result and plot the projection on the first two principle components.

# we can look at the explained variance with the summary on the results

summary(pca)

Plot the first principal component against the other variables in the dataset and discuss whether this supports your previously stated interpretation.

pc_iris <- cbind(iris_dt[Species == "setosa"], proj) pc_iris <- melt(pc_iris,

id.vars = c("Species", 'PC1', 'PC2', 'PC3', 'PC4'))

ggplot(pc_iris, aes(value, PC1)) + geom_point() + facet_wrap(~variable, scales = 'free')

Join Course

Preview

Author

Lea H.

Information

Last changed
2 years ago

Report course