For what data is the dirichlet distribution used ?
Compositinal data
What is overdispersion ?
If VAR is larger than the empiric VAR of the binomial-modell / poisson-modell theoretical VAR
The modell needs to be adapted to account for this overdisperson
Whats compositional data ?
quatintive data in a relative description
What does 16s sequencing sequence
bacterial 16s ribosome gene
super conserved region next unconserved region
fucking amazing for pirmer creation
Why do I need the the dirichlet distribution over the negative binomial
From the negative binomial we can sample absolute counts for a given sample
But no relative data
For relative abundances we would have to sample all cases and divide by the total counts
From the dirichlet distribution relative abundances can be sampled
Whats the new most common approach in biological sciences ?
data driven approach
hypothesis derived from complex data analysis
Whats the hypothesis driven approach and what are pros and cons ?
Hypothesis is defince based in deep biological understanding
Pro:
Only relevant data / covariates collected
Researcher spends less time gathering (and analyzing) the data
Con:
solid hypothesis needed
What is EDA ?
Exploratory data analysis
Give some examples for high dimensional bilogical problems?
Regressional analysis (BMI prediction from gut bacteria)
well defined regression if all p,q are independent
=> penalized and sparse regression
Differential abundance testing
multiple testing problem
A lot of univariate hypothesis tests
When can the the poisson distribution model the binomial ?
propability of an event is low
number of trials is large
What are the key assumptions of the Hardy-Weinberg Equlibrium ?
No selection
No mutation
No mirgration
Large population
Random mating
Whats the MLE for the poisson dist ?
The average
Why would I need the goodness of fit ?
We know we get the best params for our model by MLE but we don’t know how good we acctually approximated the data
Compare different fitting approaches (e.g. different base distributions)
Oi = observed
Ei = expected
Whats the null hypothesis of the Xi square test ?
NULL hypothesis: Same as with t-test, that 2 distributions are similar
Do Variance and average read counts have dependence on each other ?
Yes, we need to nromalize that
MA plots can help to asses this
What is the difference between DA/DE MA plots and MA plots
DA / DE
compare gene value against mean of all other samples
Fold change of gene against mean of all others
MA plot
Log fold change between means of 2 groups
x axis total average mean
Lower log fold change is significant if the mean is high
How does the median of ratios work ?
Step 1: creates a pseudo-reference sample (row-wise geometric mean)
root( genecounts1 x genecounts2)
Step 2: calculates ratio of each sample to the reference
gene_count / pseudo-reference
Step 3: calculate the normalization factor for each sample (size factor)
median of all sample ratios
Step 4: calculate the normalized count values using the normalization factor
raw count data / size factor
What does the median of ratios normalize and what doesn’t it normalize
Median of ratios normailzes for sequencing depth
I does not do variance stablization
What are the most common and best variance stabalizing transformations ?
CLR (centered log ratio)
good for compositinoal data
log of counts dividided by sample geometric mean
VST (DESEQ2)
rLOG
computational costly
good if we have highly variying size factors => very different sequencing depth
From rank(mean) to sd plots the performance of a normalization can be evaluated. Desired uniform distribution
What is this plot and how do we choose lambda ?
Modell selection plot
MSE for different lambda value for LASSO
We want highest possible lambda and best performance
look at lambda at lowest MSE, take Lambda that is farthest to the right
How can I test i how good my model is without new data ?
robustness testing
Do i get the same results for different subsamples of my data ?
This is shit
What can the hdi package ?
For an instable model (e.g. different amounts of predictive genes after different cross val runs)
hdi can identify stable genes (most commonly selected ones )
What are some hypothesis tests on hdi gene results
H0: All genes are euqally often selected by our model
Is Pearson’s correlation coefficient a proper measure of microbial association?
Spurious correlations due to compositionality of the data.
Spurious correlations due to high dimensionality (p>>n).
(Correlations can exist between features without a direct relationship -> Depending on the research questions, one may still be interested in correlations).
Outline the SPIEC-EASI pipeline
Data transformation -> Make counts comparable and account for compositionality
Sparse inverse covariance estimation -> Associations between taxa
Model selection -> Get a sparse network
Name some methods Methods to infer a sparse network
MB and GLASSO: Stability-based approach (StARS)
SparCC: Threshold (Correlations with a magnitude below the threshold are set to zero)
Whats a scale free graph ?
The distribution of node degrees follows a powerlaw
Whats the workflow for network analysis ?
Data perparation
taxa filtering ( taxa that are in too littel samples or just bad)
zero treatment
Normalization ( median of ratios / variance stabalizing transformations)
Network construction
Network analysis
Visualization
Network comparisson
Whats the difference between Covariance and Correlation ?
Correlation
Name Seruats (single cell) QC metrics
nFeatrueRNA
number of unique genes detected in a cell
Low-quality cells /empty low count
Cell duplets of multiplets aberrantly high gene count
nCountRNA
total number of molecules detected in a cell
percent.mt
precentage of reads that map to mitchondiral RNA
What does DADA try to acheive
Trying to distinguish what are sequencing errors and what are mutations
The propabilites that a sequencing error is misidentified as a mutation, depends on the base and the mutation.
How do I model 0 inflated data
mixture model
diract delta
e.g gamma dist
What does MOFA stand for
multi omics factor analysis
To what MDS is PCA similar ?
similar to metric euclidean MDS
What is the gap statistic and for what is it used ?
used to identify the ideal number of clusters
simulation based validation
monte carlo sampled reference data
lk is average over log(WSS) of random graphs
sk+1 is std of next gap (std computed on random models)
Whats the idea behind the permutation test ?
get background distribution of test statistic by computing a ton of test statistics on permutations
How does STARS work ?
Set a lambda
Create multiple networks with the set lambda on subsamples
Get per edge propability of occuring in a network
Get variance over all propabilities
if all propabilities are the same => very low variance
if there are very huge differences => high varainces
Select a model baseh on threshold ß on the edge prop variance graph
Last changed5 months ago