Which type of assembly algorithm would you recommend for these data (OLC vs. de Bruijn graph)? Briefly describe how it works.
Algorithmus: For long read data overlap layout consensus is preferred.
Funktionsweise:
O (Overlap): find overlaps of all reads and build a graph (nodes=reads, edges=overlaps)
L (Layout): find orientation and composition of reads
C (Consensus): merge and combine to get the final assembly
The assembly had a contig N50 value of 18.2 Mb. Define what N50 means and explain how it relates to assembly quality.
Definition: N50: Length of the smallest scaffold/contig in the set of contigs which accumulated sum of lengths make up ≥50% of the whole assembly length. (Anmerkung: Die Contigs werden vorher absteigend nach Länge sortiert).
Qualität: Higher N50 is better as it hints for a more continuous, less fragmented assembly.
...what decision(s) can the researchers make to minimize problems caused by repeats in genome assembly? Give one example and explain briefly why.
Beispiel: Use long reads instead of short reads.
Warum: → repetitive regions dont get fragmented and can therefore easier be solved.
Name two problems that can be caused by repeats in genome assemblies.
using short reads it is harder to solve repeats as they cause complex artifacts (bubbles in the de-Bruijn graph)
fragmentation / errors in assembly / incomplete assembly
How can RNA-seq data contribute to genome annotation?
It can help to solve complex patterns and problem like alternative splicing events and gives real evidence for genes (which dont rely e.g. on codon patterns, "pseudogenes").
Why are RNA samples from multiple tissues and developmental stages used instead of just one?
Genes are expressed differently in different stages and tissue; to capture more cases increases the chance of detecting all genes and complete annotation.
Apart from RNA-seq, name two other types of evidence that can be used in genome annotation.
intrinsic evidence: e.g. codon patterns, splice site motifs
extrinsic evidence: protein homology using related species infos
Explain what this BUSCO score tells you about the quality of the genome assembly. What do the terms "single copy," "duplicated," "fragmented," and "missing" mean in this context?
(chat) Der Score von 92,1% "complete" bedeutet, dass über 92% der erwarteten, evolutionär konservierten Einzelkopie-Gene in der Assemblierung vollständig gefunden wurden, was auf eine hohe Gen-Vollständigkeit hindeutet.
single copy: Das erwartete Gen wurde genau einmal und vollständig gefunden.
duplicated: Das erwartete Gen wurde mehr als einmal vollständig gefunden, was auf einen Assemblierungsfehler hindeuten kann.
fragmented: Das Gen wurde nur in Teilen gefunden.
missing: Das erwartete Gen konnte in der Assemblierung nicht gefunden werden.
Would you consider 92.1% a good BUSCO score for a modern plant genome assembly? Justify your answer briefly...
Yes quite good bc plants have really broad diversity. (chat) Zudem sind Pflanzengenome oft groß, polyploid und reich an repetitiven Elementen, was die Assemblierung erschwert und einen Score über 90% zu einem guten Ergebnis macht.
What are biological and technical replicates, and why are they important?
technical: doing multiple runs from same sample/probe (individuum) -> accounts for experimental biases (e.g. technical artifacts)
biological: taking samples from different individuums -> helps to capture biol. variability and generate generalizeable results.
Why is it important to normalize RNA-seq data before comparing gene expression between samples? Give one example of a normalization method...
Warum: Need to make data comparable and account for differences in sequencing depth or gene length.
Beispiel: e.g. TPM (transcripts per million)
They consider two aligners: STAR and Kallisto. Name two differences in how these tools process RNA-seq reads.
STAR: exact base positions mapped, produces BAM files, use gaps and complex algorithms (can detect novel transcripts)
Kallisto: pseudo alignment, only quantify transcript abundance, use k-mers and de Bruijn Graph
Which would you recommend and why, given they are interested in differential gene expression analyses?
Kallisto bc transcript abundance is more interesting and no need for exact mapping and novel transcripts. moreover way faster and less ressource demanding.
Kallisto uses "equivalence classes" in its pseudoalignment algorithm. What is an equivalence class and why is it useful?
A set of transcripts on to which one or more reads are mapped. It helps to quantify and estimate transcript abundance more efficiently, especially when reads map to multiple isoforms of a gene.
After mapping, you plot a sample distance heatmap. How are the distances between samples estimated, and what does the heatmap show?
(chat) Distanzschätzung: Die Distanzen werden basierend auf den Genexpressionsprofilen aller Proben berechnet, z.B. durch die euklidische Distanz oder die Pearson-Korrelation auf normalisierten Zählwerten.
(chat) Was die Heatmap zeigt: Sie visualisiert die Ähnlichkeit zwischen den Proben. Proben mit ähnlichen Expressionsmustern (z. B. alle Proben aus der "control"-Gruppe) sollten nahe beieinander gruppieren (clustern), während unähnliche Proben weiter voneinander entfernt sind.
Why are sample distance measures useful in RNA-seq experiments?
To find the differences in gene expression or detect technical/bio. artifacts/errors
What do the adjusted p-value and log2 fold change (LFC) tell you about a gene?
adjusted p-value: confidence of value (significance estimation), account for multiple testing errors (FDR)
LFC: how big the difference in expression is
What is the multiple hypothesis correction (MHC) problem in RNA-seq, and how does DESeq2 address it?
Problem: we test for thousands of genes -> risk of false positives
Lösung: e.g. Benjamini-Hochberg or Bonferroni correction (chat: DESeq2 verwendet standardmäßig die Benjamini-Hochberg-Methode zur Kontrolle der False Discovery Rate (FDR)).
Why do we usually apply an LFC threshold (e.g., |LFC|>1) in addition to the adjusted p-value when calling DEGs?
so that really small expression differences (chat: die zwar statistisch signifikant, aber biologisch möglicherweise nicht relevant sind) are not creating an impact.
Why might these low-count genes lead to unreliable differential expression results? How can you deal with them?
Problem: (chat) Gene mit sehr wenigen Counts haben eine hohe Varianz, was die statistischen Tests unzuverlässig macht und zu falsch positiven Ergebnissen führen kann. Sie können auch die Normalisierungsfaktoren verzerren.
Lösung: do preprocessing (leaving them out for further analysis)
Zuletzt geändertvor 22 Tagen