Lecture 5 Normalization

Buffl

Analysis High-dim Biodata

by Benjamin K.

List the steps in the high-troughput data analysis pipeline

Normalize
Unsupervised clustering
Modeling raw counts for each gene
Shrinking log2 fold changes
Testing for differential expression

Why do we need to normalize the data ?

There are systematic biases in the data ( sequencing machine / lab / process)
different gene lengths
different sequencing depths
gene compositionality

Whats the simplest normalization ?

Total sum normalization (feature count/total feature sum) per sample
Normalizes different sequencing depth
Relative abundances can be compared accross samples

Pitfall: If one Gene is highly highly expressed, other genes will look under expressed compared to other samples

Whats Common-scale normalization ?

We want equal counts in each sample
total count normalization * min sequencing depth
Benefit of being left with counts rather than proportional data

Whats a popular but suboptimal normalization ?

Rarefying
goal of having equal counts
sampling of each sample based on relative abundances
ends up with min seq depth

Loss of a lot of counts

Whats a sequencing library ?

The collection of DNA molecules used as input for the sequencing machine

Name common normalization methods

Describe the steps of DeSeq2 median of ratios.

compute a pseudo reference sample

Pseudo feature A = sqrt ( sA fA x sB fA x sC fA)

Get ratio of each feature for each sample

ratio sA fA = fA / PS fA

Compute median of ratios
Devide all sample counts by media ratio

accounts for sequening depth / rna compositionality

What other kinds of normalization if benefical for hypothesis testing ?

Variance stabalizing transformations

Name some important distance measures

Euclidean
Weighted euclidean
Manhattan/L1 distances
Maximum
Minikowski
Jaccard Distance ( focuses on cooccurance )

Whats the proper distance for compositional data ?

Aitchson distance

What are special distances for biological strings ?

Levenstein distance, number of insersts deletions and substitutions

Whats the main idea behind multi dimensional scaling ?

Have a 2D representation where the 2D distances resemble the distances in the high dimensional space
Stress is the measure how similar 2 Distance matricies are

Whats the difference between metric and non-metric MDS

metric methods perserve the actual distances
non-metrix methods preserve the rank of the distances

Give an example why Manifold learning is necesary

Swissroll dataset

What does TPM normalize for ?

seq depth
gene length

What is bray curtis dist ?

Cij sum of lesser counts in both samples
Si total sample 1
Sj total sample 2

Join Course

Preview

Author

Benjamin K.

Information

Last changed
a year ago

Report course