What are the advantages and disadvantages of scRNAseq compared to Bulk-RNAseq?
+ characterization of cellular heterogeneity within a population. (identify rare cell types, distinguish different cell states, capture cell-to-cell variability in gene expression)
+ discovery of novel cell types or subpopulations
+ can capture gene expression changes over time
- Higher cost and lower throughput
- more complex data analysis
- Increased technical noise
Given a count matrix, what do we have to normalize for before any further analysis given we want to do?
sequencing depth -> Between sample comparison
gene length -> Within sample comparison
What normalization techniques can be used to normalize for sequencing depth and gene length?
divide by length of gene, constant 10^6, total number of reads
Use: within or between sample groups
divide by total number of reads, constant 10^6
Use: between sample groups
divide by total number of reads, constant 10^6, length of a gene
within sample groups
divide alle RPKM by 2
What is the reason for using a negative binomial rather than a poisson distribution for modeling read counts?
Why are TPM and FPKM considered within-sample measures of gene expression? Name two between-sample biases they do not capture.
normalize each sample independently
RNA Composition
Batch effects
What are barcodes in the context of scRNAseq?
label and differentiate RNA molecules originating from different cells within a mixed population.
distinguish and assign RNA reads to specific cells, enabling cell-level analysis in scRNA-seq.
used to distinguish individual RNA molecules within a cell.
utilized to identify and quantify individual RNA molecules within a cell, improving the accuracy of gene expression quantification.
Both barcode types are essential components of scRNA-seq library preparation and play critical roles in demultiplexing and data analysis.
What is the Droplet Method and how does ist work?
The droplet method, also known as droplet-based scRNAseq is a technique used in single-cell genomics to analyze the gene expression profiles of individual cells.
Steps
Cell suspension & barcoding
Droplet Generation
Emulsion
Reaction in Droplets
Reaction after Demulsification
Down Stream Applications etc.
What is the basic workflow to analyze scRNAseq?
Preprocessing
control, normalization, and feature selection
Visualization
PCA, t-SNE, UMAP
Analysis
Cluster Analysis, Trajectory Analysis
scRNAseq-Analysis: What is clustering analysis?
cell cluster -> first result of any single-cell-anlysis
=> infer the identity of member cells
Clusters groups cells based on the similarity of their gene expression profiles
Expression profile similarity - determined by distance metrics (takes dimensionality reduced representations as input)
Two approaches to generate cell clusters from similarity scores:
clustering algorithms
community detection methods
What are doublets in context of scRNAseq?
Cells with…
unexpected high count
Large #detected genes
What is the advantage of UMIs?
help prevent PCR-bias
Differential expression analysis: What is the main cause for Bias? Which metrics to correct it?
Cause of Bias?
sequencing depth
library efficiency
amplification bias
How to correct?
TBD
Explain how DeSeq2 borrows information across genes.
Model expression mean vs. variance over all genes -> genes w/ similar mean are estimated together for their variance
empirical Bayes estimation:
Step 1: creates a pseudo-reference sample (row-wise geometric mean)
Step 2: calculates ratio of each sample to the reference
Step 3: calculate the normalization factor for each sample (size factor)
What distribution follows RNA-seq data? And why?
Negative binomial distribution
-> overdispersion
Differential expression analysis: Negativ Binomial Distribution vs. Poisson
distribution
Read counts descriped by Poisson distribution (mean and variance are equal)
-> Not the case for RNAseq-counts (overdispersion)
=> negativ binomial distribution - takes this into account through a dispersion
parameter alpha
Differential expression analysis - What is the source of overdispersion?
transcript is present at slightly different levels in each sample
What is STAR and why is it so efficient?
STAR is a alignment tool for RNAseq data.
Usage of uncompressed suffix array to search for MMPs
How does STAR basically work?
maximal mappable prefix (MMP) is determined for each aligned read
Seed: search for unmapped suffix that can be in another exon
Extend: no exact match for seed found -> extend previous seed; if no good alignment -> remove suffix from read
stich seeds togehter based on (a) proximity to a set of anchors or (b) best on best alignment of read
Why RNA-Seq Pseudo alignment?
Goal: Estimate transcript abundances
==> Classify diseases, understand expression changes, track cancer progression
Estimation but accurate
faster and more efficient than alignment based approaches
Tools: Kallisto and Salmon
How does SPONGE work?
What are the main ideas behind the steps in the analysis?
Identify likely miRNA-gene pairs
Indetify ceRNA pairs (<-> shared miRNAs) and calculate sensitivity correlation
Use SPONGE null model to infere p-values of the significance of the interaction
Name three biases that are overcome by the null model of the SPONGE method in comparison to previous correlation-based approaches (3 points):
Gene-gene correlation
sample size
several miRNAs regulate many transcripts
What are the advantages of the SPONGE method?
How does KALLISTO work, what are the inputs/outputs?
Reference transcriptome
RNA-Seq reads from experiment
Indexing / Hashing of k-mers
Construction of hash table of k-mers to contigs and their position within
Skipping of redundant k-mers in same k-compatibility class
Intersection of constituent k-mers => k-compatibility class of read
Pseudoalignment
lookup of k-compatibility class for each k-mer in kallisto index, intersecting the k-compatibility classes
K-mer hashing is strand agnostic
Optimization: All k-mers in a contig of the de Bruijn graph have the same k-compatibility class
=> For each k-mer lookup, find distance to junctions at the end of contig ==> skip k-mers up to that distance
Quatification
Kallisto-Index, Quantification of RNA-Seq samples (transcript Pseudoalignment abundances)
Pseudoalignment, as implemented in methods such as Kallisto, is considerably faster than classical alignment-based approaches. Which of the following steps are part of the Kallisto method? (2 points) (multiple correct answers possible)
What makes pseudoalignment so much faster than mapping approaches?
What approaches for gene quantification from RNA-seq can be used
Which of the following terms is NOT used in pseudoalignment?
Give some examples for small and large non-coding RNAs
lncRNA
eRNA
circRNA
miRNA
siRNA
What are miRNA?
miRNAs are 19-22 nucleotide long molecules
key regulators of gene expression -> each miRNA can potentially regulate hundreds of genes
Name a hyptothesis of miRNAs and what does it state?
Competing endogenous RNA hypothesis == SPONGE hypothesis
RNA can compete for binding to miRNAs through shared MREs (miRNA response elements) present in their sequences
=> influence availability of miRNAs and affect the expression of target messenger RNAs (mRNAs)
suggests that when non-coding RNAs and coding RNAs contain similar MREs, they can act as "sponges" for miRNAs
this interaction among RNA molecules => complex regulatory network
=> abundance of one RNA => influence the expression of other RNAs (by competing for shared miRNAs)
ceRNA <=> Key for hidden RNA language and found everywhere
In contrast to microarrays, RNA-seq can
What methods exist to measure gene expression?
Microarray
Nanopore & -drop
Illumina HiSeq
Which of the following methods can be used for sequence assembly?
What approaches for gene quantification from RNA-seq can be used?
Which statement is true?
What are desirable characteristics of single cell technologies?
What is the Barnyard plot used for?
What are UMIs and what are they used for?
What highlights issues in quality control?
What is NOT part of downstream analysis?
Which of the following proteins is NOT involved in miRNA synthesis?
Which factor is NOT considered by miRNA target prediction tools?
Which statement is NOT true about the competing endogenous RNA hypothesis?
What are applications of RNAseq?
transcript discovery
gene quantification
expression profiling
Name the three types of RNAseq reads.
Single-end
Paired-end
Multiplexing
Why is it not a good idea to only look at the log fold change when comparing gene expression between two conditions?
p-value
Name two models to do differential expression analysis
DESeq2
edgeR
What is the idea behind LFC shrinkage in DESeq2?
Correct fold change for variance
Why is it difficult to estimate dispersion in practice?
Often too few samples per group
Explain DESeq2’s dispersion shrinkage
treat each gene seperately -> estimate dispersion
fit a smoothing curve
shrink gene-specific dispersion towards expected dispersion
What is dimensionality reduction and why is it useful?
Dimensionality reduction, UMAP vs tSNE.
Last changed9 months ago