List the steps in the high-troughput data analysis pipeline
Normalize
Unsupervised clustering
Modeling raw counts for each gene
Shrinking log2 fold changes
Testing for differential expression
Why do we need to normalize the data ?
There are systematic biases in the data ( sequencing machine / lab / process)
different gene lengths
different sequencing depths
gene compositionality
Whats the simplest normalization ?
Total sum normalization (feature count/total feature sum) per sample
Normalizes different sequencing depth
Relative abundances can be compared accross samples
Pitfall: If one Gene is highly highly expressed, other genes will look under expressed compared to other samples
Whats Common-scale normalization ?
We want equal counts in each sample
total count normalization * min sequencing depth
Benefit of being left with counts rather than proportional data
Whats a popular but suboptimal normalization ?
Rarefying
goal of having equal counts
sampling of each sample based on relative abundances
ends up with min seq depth
Loss of a lot of counts
Whats a sequencing library ?
The collection of DNA molecules used as input for the sequencing machine
Name common normalization methods
Describe the steps of DeSeq2 median of ratios.
compute a pseudo reference sample
Pseudo feature A = sqrt ( sA fA x sB fA x sC fA)
Get ratio of each feature for each sample
ratio sA fA = fA / PS fA
Compute median of ratios
Devide all sample counts by media ratio
accounts for sequening depth / rna compositionality
What other kinds of normalization if benefical for hypothesis testing ?
Variance stabalizing transformations
Name some important distance measures
Euclidean
Weighted euclidean
Manhattan/L1 distances
Maximum
Minikowski
Jaccard Distance ( focuses on cooccurance )
Whats the proper distance for compositional data ?
Aitchson distance
What are special distances for biological strings ?
Levenstein distance, number of insersts deletions and substitutions
Whats the main idea behind multi dimensional scaling ?
Have a 2D representation where the 2D distances resemble the distances in the high dimensional space
Stress is the measure how similar 2 Distance matricies are
Whats the difference between metric and non-metric MDS
metric methods perserve the actual distances
non-metrix methods preserve the rank of the distances
Give an example why Manifold learning is necesary
Swissroll dataset
What does TPM normalize for ?
seq depth
gene length
What is bray curtis dist ?
Cij sum of lesser counts in both samples
Si total sample 1
Sj total sample 2
Last changed5 months ago