Why RNA Seq? What is the goal?
How does it work?
Why?:
sequence transcriptome to learn more about gene expression cells
Goal:
gene/transicpt quanification
transcript discovery
expression profiling
How?:
Design Experiment
RNA Preparation
isolate and purify RNA
Prepare Libraries
convert RNA -> cDNA, while adding sequencing adapters
Sequence
Analysis
analyze short read seqs
What are the RNA-Seq read types?
single-end
reads produced from a single end
paired-end
reads produced from a fragment and another read from the opposite side of the fragment
multiplexing
adding individual barcode seq to each sample
Explain the following Expression Units:
CPM
RPKM
FPKM
TPM
CPM = Counts per million
sum of reads divided by a million
RPKM = Reads per kilobase million
CPM/RPM (read per million) divided by length
used for single end
FPKM = Fragments per kilobase milllion
RPKM divided by 2
used for paired end
TPM = Transcript per million
sum of reads divided by length*million
-> All of them consider sequence length and except for CPM every unit considers gene length as well
What is differential expression analysis?
= taking the normalized read count data and performing statistical analysis to discover quantitative changes in expression levels between experimental groups
(follows after RNA-Seq)
Why not t-test on read count data?
What is the alternative?
bias in data
TPM does not address differences in efficiency, protocol, tissue difference
sample size often too low to estimate variance of an individual gene’s expression with any confidence
t-test requires normal distribution, but RNA-seq follows a negative binomial distribution
Alternative:
Borrow information across genes
Option 1: transform data and then use linear model
Option 2: model mean/ variance relationship and shrink dispersion (e.g. DeSeq2)
Use negative binomial distribution, which takes the inequality of mean and variance into account by using a dispersion parameter
What is overdispersion?
= caused by mean and variance not being equal (variance is higher than mean) in RNA counts
Reason: transcript is present at slightly different levels in each sample
-> Instead of poisson, negative binomial distribution
What are the options of DeSeq2?
Model read count with normalization
Median of ratio method
Dispersion shrinkage
Log Fold Change
How does DeSeq2 model read counts with normalization?
Uses negative binomial generalized linear model
Takes size factor into account -> normalization factor
Typically, constant bias for all samples, but gene-sample-specific size factors exist
Estimated with median of ratios method
Models read counts for gene x in sample y
How does DeSeq2’s median of ratio method work?
Purpose: normalization
1. Creates pseudo-reference sample (row-wise geometric mean)
2. Calculate ratio of each sample to the reference
3. Caculate normalization factor for each sample (size factor)
How does the Dispersion Shinkage work in DeSeq2?
1. Estimate dispersion per gene (black)
2. Fit smoothing curve (red)
3. Shrink gene-specific dispersion towards expected dispersion (blue) using empirical Bayes approach
Strength of shrinkage depends on
How close dispersion values are to the fit
Degree of freedom (shrinkage reduced for larger samples)
What is the Log Fold Change method for in DeSeq2?
Problem with data: variance of LFC estimates for genes with low read count
A: before shrinking of LFC
B: after shrinking
C: counts show low dispersion for green and high for purple
D: density plot of likelihoods (solid lines scaled to one) and posteriors (dashed) for green and purple genes and the prior (black)
-> Correction removes false positives
What is alternative splicing?
What is the problem with this and how can it be solved?
= different possible combination of exons, resulting in different proteins
Problem:
-> splicing makes mapping more difficult, since transcriptome might not fit exactly to genome
Solution:
-> splice aware mapping
Exact match to GT-AG, GC-AG, AT-AC
-> Examples: TopHat2
The list of known splice sites
-> Examples: HISAT2
Models (HMM, SVM)
-> Examples: HMMSplicer, PALMapper, OLEgo
Unbiased from RNA-Seq
-> Examples: CRAC, MapSplice2, STAR
What is STAR (Spliced Transcripts Alignment to a Reference)?
What are the steps?
= a fast RNA-seq read mapper, with support for splice-junction and fusion read detection
Seed searching:
Align reads by searching for Maximal Mappable Prefix (= longest sequence that matches exactly one or more locations on reference genome)
Different part of reads, which are differently mapped, are called seeds
First part of MMP that was mapped with genome is seed 1, then rest sequence is searched in reference, if match this will be the second seed
Extend MMPs:
If no exact matching seq for each part of the read, the previous MPP will be extended
-> Reasons for no match: mismatch or indels
Soft Clipping:
-> If extension does not return a good alignment, then poor quality or adapter seq will be soft clipped
Cluster, Stitching and Scoring:
-> Create complete read by
first clustering the seeds together based on proximity to a set of anchor seed or seeds that are not multi-mapping
then stitched together based on best alignment for the read
get best alignment scoring based on mismatches, indels and gaps, etc.
How is the runtime of STAR?
efficiency due to sequential search of only the unmapped portions of reads
uses an uncompressed suffix array (SA) for efficient MMPs search
allows quick searching against large reference genomes
other tools: use algorithms, which often search for entire read seq before slitting reads and perform iterative rounds of mapping -> slow
What is a splicing graph?
= directed graph with transcripts as nodes and edges combining consecutive transcripts
What is the workflow of splAdder?
What graphs can it create?
= tool for alternative splicing analysis based on RNA-Seq alignment data
Possible graphs from alternative spicing events:
What is the goal and problems of RNA-seq pseudo-alignments?
Estimate transcript abundances
Classify diseases, understand expression changes, track cancer progression
Problems:
Only an estimate, as huge number of sequences are considered
Nonetheless accurate, as used for diagnostics
Traditional: Alignment based approaches
But: Slow, substantial computational resources
What is KALLISTO for?
What is the in- and output of KALLISTO?
= used for fast and accurate quantification of gene expression from RNA-seq data
-> get transcript abundances for gene expression analysis
Input:
Reference transcriptome
RNA-Seq reads from experiment
Output:
Kallisto
Index, Quantification of RNA-Seq samples (transcript abundances)
Steps to solve this problem:
Create a transcript de-Bruijn graph (T-DBG) based on k-mers with k ∈ ℕ. Given x transcript sequences
For each same sequence snippet of length k, create a node and connect each consecutive node with an edge
A)
Apply EM on RNA reads. Given x transcript sequences and y reads, create a table with reads as rows and transcript as columns
AND
B)
Get equivalence classes with table from previous example
For each read, fill cell with 1 if read is in transcript else 0
Each row, which has the same values is part of an equivalence class
Counting the rows with same values results in the corresponding equivalence class count
Explain the cell quantification technique:
Flow Cytometry
What are the limitations?
Characterizes and defines cell types in a heterogeneous cell population by analysing expression of cell surface and intercellular molecules
Used to measure fluorescence produced by fluorescent antibodies detecting proteins
Limitations:
Slow data acquisition
Cell transmissions efficiency
Not all measured cells are alive
Immunohistochemistry staining (IHC staining)
Uses antibodies to bind proteins in tissues
Visualize location of specific cell types by using layers of coloured complexes
High dependency on antibody quality
No unique reference for the stainings
High set up costs
What is the deconvolution problem?
= challenge of estimating the cell type or tissue composition of a heterogeneous sample based on its gene expression profile
-> Gene expression in an heterogeneous sample can be modelled as the weighted sum of the expression value for each cell type present in the mixture.
Matrix notation:
D = C x F
D = measured expression values
C = cell type specific expression values
F = relative cell type proportion
-> Reference based deconvolution: D and C is given
What is the Constrained optimization problem?
What is another approach for it?
Goal: minimize sum of squares between fitted (C x F) and observed values D
Constraints:
Proportion in C= {0,1}
Sum of proportion per sample = 1
Another approach:
latent variable problem, where each cell type constitutes a latent factor
extract latent variables from matrix with Nonnegative Matrix Factorization (NMF)
What is Nonnegative Matrix Factorization (NMF)?
unsupervised learning algo
projects data into lower dimensional spaces
reduces number of features, while retaining basic info
decomposes a matrix, containing only negative coefficients, into two non-negative matrices with reduced ranks
Name reference based tools for in silico deconvolution of bulk RNA-Seq data
Marker genes
= list of enriched genes for each cell type
Deconvolution
= ‘inverse’ matrix multiplication with reference profiles
Last changeda year ago