VL5-Advanced RNA-Seq Methods

Buffl

AdvBinfo

by Korbinian P.

Why RNA Seq? What is the goal?

How does it work?

Why?:

sequence transcriptome to learn more about gene expression cells

Goal:

gene/transicpt quanification
transcript discovery
expression profiling

How?:

Design Experiment
RNA Preparation
- isolate and purify RNA
Prepare Libraries
- convert RNA -> cDNA, while adding sequencing adapters
Sequence
Analysis
- analyze short read seqs

What are the RNA-Seq read types?

single-end
- reads produced from a single end
paired-end
- reads produced from a fragment and another read from the opposite side of the fragment
multiplexing
- adding individual barcode seq to each sample

Explain the following Expression Units:

CPM
RPKM
FPKM
TPM

CPM = Counts per million
- sum of reads divided by a million
RPKM = Reads per kilobase million
- CPM/RPM (read per million) divided by length
- used for single end
FPKM = Fragments per kilobase milllion
- RPKM divided by 2
- used for paired end
TPM = Transcript per million
- sum of reads divided by length*million

-> All of them consider sequence length and except for CPM every unit considers gene length as well

What is differential expression analysis?

= taking the normalized read count data and performing statistical analysis to discover quantitative changes in expression levels between experimental groups

(follows after RNA-Seq)

Why not t-test on read count data?

What is the alternative?

bias in data
TPM does not address differences in efficiency, protocol, tissue difference
sample size often too low to estimate variance of an individual gene’s expression with any confidence
t-test requires normal distribution, but RNA-seq follows a negative binomial distribution

Alternative:

Borrow information across genes
- Option 1: transform data and then use linear model
- Option 2: model mean/ variance relationship and shrink dispersion (e.g. DeSeq2)
Use negative binomial distribution, which takes the inequality of mean and variance into account by using a dispersion parameter

What is overdispersion?

= caused by mean and variance not being equal (variance is higher than mean) in RNA counts

Reason: transcript is present at slightly different levels in each sample

-> Instead of poisson, negative binomial distribution

What are the options of DeSeq2?

Model read count with normalization
Median of ratio method
Dispersion shrinkage
Log Fold Change

How does DeSeq2 model read counts with normalization?

Uses negative binomial generalized linear model
Takes size factor into account -> normalization factor
- Typically, constant bias for all samples, but gene-sample-specific size factors exist
- Estimated with median of ratios method
- Models read counts for gene x in sample y

How does DeSeq2’s median of ratio method work?

Purpose: normalization
1. Creates pseudo-reference sample (row-wise geometric mean)
2. Calculate ratio of each sample to the reference
3. Caculate normalization factor for each sample (size factor)

How does the Dispersion Shinkage work in DeSeq2?

1. Estimate dispersion per gene (black)
2. Fit smoothing curve (red)
3. Shrink gene-specific dispersion towards expected dispersion (blue) using empirical Bayes approach

Strength of shrinkage depends on
- How close dispersion values are to the fit
- Degree of freedom (shrinkage reduced for larger samples)

What is the Log Fold Change method for in DeSeq2?

Problem with data: variance of LFC estimates for genes with low read count

A: before shrinking of LFC
B: after shrinking
C: counts show low dispersion for green and high for purple
D: density plot of likelihoods (solid lines scaled to one) and posteriors (dashed) for green and purple genes and the prior (black)

-> Correction removes false positives

What is alternative splicing?

What is the problem with this and how can it be solved?

= different possible combination of exons, resulting in different proteins

Problem:

-> splicing makes mapping more difficult, since transcriptome might not fit exactly to genome

Solution:

-> splice aware mapping

Exact match to GT-AG, GC-AG, AT-AC
-> Examples: TopHat2
The list of known splice sites
-> Examples: HISAT2
Models (HMM, SVM)
-> Examples: HMMSplicer, PALMapper, OLEgo
Unbiased from RNA-Seq
-> Examples: CRAC, MapSplice2, STAR

What is STAR (Spliced Transcripts Alignment to a Reference)?

What are the steps?

= a fast RNA-seq read mapper, with support for splice-junction and fusion read detection

Seed searching:
- Align reads by searching for Maximal Mappable Prefix (= longest sequence that matches exactly one or more locations on reference genome)
- Different part of reads, which are differently mapped, are called seeds
  - First part of MMP that was mapped with genome is seed 1, then rest sequence is searched in reference, if match this will be the second seed

Extend MMPs:
- If no exact matching seq for each part of the read, the previous MPP will be extended
  -> Reasons for no match: mismatch or indels

Soft Clipping:
-> If extension does not return a good alignment, then poor quality or adapter seq will be soft clipped

Cluster, Stitching and Scoring:
-> Create complete read by
- first clustering the seeds together based on proximity to a set of anchor seed or seeds that are not multi-mapping
- then stitched together based on best alignment for the read
- get best alignment scoring based on mismatches, indels and gaps, etc.

How is the runtime of STAR?

efficiency due to sequential search of only the unmapped portions of reads
uses an uncompressed suffix array (SA) for efficient MMPs search
- allows quick searching against large reference genomes
other tools: use algorithms, which often search for entire read seq before slitting reads and perform iterative rounds of mapping -> slow

What is a splicing graph?

= directed graph with transcripts as nodes and edges combining consecutive transcripts

What is the workflow of splAdder?

What graphs can it create?

= tool for alternative splicing analysis based on RNA-Seq alignment data

Possible graphs from alternative spicing events:

What is the goal and problems of RNA-seq pseudo-alignments?

Goal:

Estimate transcript abundances
- Classify diseases, understand expression changes, track cancer progression

Problems:

Only an estimate, as huge number of sequences are considered
- Nonetheless accurate, as used for diagnostics
Traditional: Alignment based approaches
- But: Slow, substantial computational resources

What is KALLISTO for?

What is the in- and output of KALLISTO?

= used for fast and accurate quantification of gene expression from RNA-seq data

-> get transcript abundances for gene expression analysis

Input:

Reference transcriptome
RNA-Seq reads from experiment

Output:

Kallisto
- Index, Quantification of RNA-Seq samples (transcript abundances)

Steps to solve this problem:

Create a transcript de-Bruijn graph (T-DBG) based on k-mers with k ∈ ℕ. Given x transcript sequences

For each same sequence snippet of length k, create a node and connect each consecutive node with an edge

Steps to solve this problem:

Apply EM on RNA reads. Given x transcript sequences and y reads, create a table with reads as rows and transcript as columns

AND

Get equivalence classes with table from previous example

For each read, fill cell with 1 if read is in transcript else 0

Each row, which has the same values is part of an equivalence class
Counting the rows with same values results in the corresponding equivalence class count

Explain the cell quantification technique:

Flow Cytometry

What are the limitations?

Characterizes and defines cell types in a heterogeneous cell population by analysing expression of cell surface and intercellular molecules
Used to measure fluorescence produced by fluorescent antibodies detecting proteins
Limitations:
- Slow data acquisition
- Cell transmissions efficiency
- Not all measured cells are alive

Explain the cell quantification technique:

Immunohistochemistry staining (IHC staining)

Uses antibodies to bind proteins in tissues
Visualize location of specific cell types by using layers of coloured complexes
Limitations:
- High dependency on antibody quality
- No unique reference for the stainings
- High set up costs

What is the deconvolution problem?

= challenge of estimating the cell type or tissue composition of a heterogeneous sample based on its gene expression profile

-> Gene expression in an heterogeneous sample can be modelled as the weighted sum of the expression value for each cell type present in the mixture.

Matrix notation:

D = C x F

D = measured expression values

C = cell type specific expression values

F = relative cell type proportion

-> Reference based deconvolution: D and C is given

What is the Constrained optimization problem?

What is another approach for it?

Goal: minimize sum of squares between fitted (C x F) and observed values D
Constraints:
- Proportion in C= {0,1}
- Sum of proportion per sample = 1
Another approach:
- latent variable problem, where each cell type constitutes a latent factor
- extract latent variables from matrix with Nonnegative Matrix Factorization (NMF)

What is Nonnegative Matrix Factorization (NMF)?

unsupervised learning algo
projects data into lower dimensional spaces
reduces number of features, while retaining basic info
decomposes a matrix, containing only negative coefficients, into two non-negative matrices with reduced ranks

Name reference based tools for in silico deconvolution of bulk RNA-Seq data

Marker genes
= list of enriched genes for each cell type
Deconvolution
= ‘inverse’ matrix multiplication with reference profiles

Join Course

Preview

Author

Korbinian P.

Information

Last changed
2 years ago

Report course