undefined

Buffl

Analysis High-dim Biodata

by Benjamin K.

For what data is the dirichlet distribution used ?

Compositinal data

What is overdispersion ?

If VAR is larger than the empiric VAR of the binomial-modell / poisson-modell theoretical VAR
The modell needs to be adapted to account for this overdisperson

Whats compositional data ?

quatintive data in a relative description

What does 16s sequencing sequence

bacterial 16s ribosome gene
super conserved region next unconserved region
fucking amazing for pirmer creation

Why do I need the the dirichlet distribution over the negative binomial

From the negative binomial we can sample absolute counts for a given sample
- But no relative data
- For relative abundances we would have to sample all cases and divide by the total counts
From the dirichlet distribution relative abundances can be sampled

Whats the new most common approach in biological sciences ?

data driven approach
- hypothesis derived from complex data analysis

Whats the hypothesis driven approach and what are pros and cons ?

Hypothesis is defince based in deep biological understanding
Pro:
- Only relevant data / covariates collected
- Researcher spends less time gathering (and analyzing) the data
Con:
- solid hypothesis needed

What is EDA ?

Exploratory data analysis

Give some examples for high dimensional bilogical problems?

Regressional analysis (BMI prediction from gut bacteria)
- well defined regression if all p,q are independent
- => penalized and sparse regression
Differential abundance testing
- multiple testing problem
- A lot of univariate hypothesis tests

When can the the poisson distribution model the binomial ?

propability of an event is low
number of trials is large

What are the key assumptions of the Hardy-Weinberg Equlibrium ?

No selection
No mutation
No mirgration
Large population
Random mating

Whats the MLE for the poisson dist ?

The average

Why would I need the goodness of fit ?

We know we get the best params for our model by MLE but we don’t know how good we acctually approximated the data
Compare different fitting approaches (e.g. different base distributions)

Oi = observed
Ei = expected

Whats the null hypothesis of the Xi square test ?

NULL hypothesis: Same as with t-test, that 2 distributions are similar

Do Variance and average read counts have dependence on each other ?

Yes, we need to nromalize that
MA plots can help to asses this

What is the difference between DA/DE MA plots and MA plots

DA / DE
- compare gene value against mean of all other samples
- Fold change of gene against mean of all others

MA plot
- Log fold change between means of 2 groups
- x axis total average mean
- Lower log fold change is significant if the mean is high

How does the median of ratios work ?

Step 1: creates a pseudo-reference sample (row-wise geometric mean)
- root( genecounts1 x genecounts2)
Step 2: calculates ratio of each sample to the reference
- gene_count / pseudo-reference
Step 3: calculate the normalization factor for each sample (size factor)
- median of all sample ratios
Step 4: calculate the normalized count values using the normalization factor
- raw count data / size factor

What does the median of ratios normalize and what doesn’t it normalize

Median of ratios normailzes for sequencing depth
I does not do variance stablization

What are the most common and best variance stabalizing transformations ?

CLR (centered log ratio)
- good for compositinoal data
- log of counts dividided by sample geometric mean

VST (DESEQ2)
rLOG
- computational costly
- good if we have highly variying size factors => very different sequencing depth

From rank(mean) to sd plots the performance of a normalization can be evaluated. Desired uniform distribution

What is this plot and how do we choose lambda ?

Modell selection plot
MSE for different lambda value for LASSO
We want highest possible lambda and best performance
look at lambda at lowest MSE, take Lambda that is farthest to the right

How can I test i how good my model is without new data ?

robustness testing
Do i get the same results for different subsamples of my data ?

This is shit

What can the hdi package ?

For an instable model (e.g. different amounts of predictive genes after different cross val runs)
hdi can identify stable genes (most commonly selected ones )

What are some hypothesis tests on hdi gene results

H0: All genes are euqally often selected by our model

Is Pearson’s correlation coefficient a proper measure of microbial association?

Spurious correlations due to compositionality of the data.
Spurious correlations due to high dimensionality (p>>n).
(Correlations can exist between features without a direct relationship -> Depending on the research questions, one may still be interested in correlations).

Outline the SPIEC-EASI pipeline

Data transformation -> Make counts comparable and account for compositionality
Sparse inverse covariance estimation -> Associations between taxa
Model selection -> Get a sparse network

Name some methods Methods to infer a sparse network

MB and GLASSO: Stability-based approach (StARS)
SparCC: Threshold (Correlations with a magnitude below the threshold are set to zero)

Whats a scale free graph ?

The distribution of node degrees follows a powerlaw

Whats the workflow for network analysis ?

Data perparation
1. taxa filtering ( taxa that are in too littel samples or just bad)
2. zero treatment
3. Normalization ( median of ratios / variance stabalizing transformations)
Network construction
Network analysis
Visualization
Network comparisson

Whats the difference between Covariance and Correlation ?

Correlation

Name Seruats (single cell) QC metrics

nFeatrueRNA
- number of unique genes detected in a cell
- Low-quality cells /empty low count
- Cell duplets of multiplets aberrantly high gene count
nCountRNA
- total number of molecules detected in a cell
percent.mt
- precentage of reads that map to mitchondiral RNA

What does DADA try to acheive

Trying to distinguish what are sequencing errors and what are mutations
The propabilites that a sequencing error is misidentified as a mutation, depends on the base and the mutation.

How do I model 0 inflated data

mixture model
diract delta
e.g gamma dist

What does MOFA stand for

multi omics factor analysis

To what MDS is PCA similar ?

similar to metric euclidean MDS

What is the gap statistic and for what is it used ?

used to identify the ideal number of clusters
simulation based validation
monte carlo sampled reference data
lk is average over log(WSS) of random graphs
sk+1 is std of next gap (std computed on random models)

Whats the idea behind the permutation test ?

get background distribution of test statistic by computing a ton of test statistics on permutations

How does STARS work ?

Set a lambda
Create multiple networks with the set lambda on subsamples
Get per edge propability of occuring in a network
Get variance over all propabilities
1. if all propabilities are the same => very low variance
2. if there are very huge differences => high varainces
Select a model baseh on threshold ß on the edge prop variance graph

Join Course

Preview

Author

Benjamin K.

Information

Last changed
2 years ago

Report course

Übung

Author

Benjamin K.

Information