Briefly describe the process of Sanger sequencing.
DNA polymerase -> fluorescent terminators
results in random fragments with marked terminator
sort fragments by length (gel-electrophoresis)
computer reads light signals -> color at each base
—> But:
short reads
high error rate
complicated, takes long
Name some Next-generation-sequencing approaches.
2nd Generation (collection of molecules)
Illumina
Roche/454
SOLiD
3rd Generation (single molecules)
Oxford Nanopore
Pacific Biosciences
Ion Torrent
Give an overview of the experimental step in an RNA-sequencing protocol.
sample extraction
target enrichment
cDNA (add adapters for PCR & amplify)
Sequencing
Mapping
Data analysis
What is the general workflow of Illumina sequencing?
Library preparation
Fragment
add adaptors
amplify
Cluster generation
attach to flow cell
bridge amplification -> build double strand from fragment
split bridge into two single strands
generate clusters
Sequencing (by synthesis)
add fluorescent nucleotides
laser -> image -> determine base
Name different general 2nd Gen NGS approaches.
sequence-by-synthesis
sequence-by-ligation
pyrosequencing
454
What is a measure of base quality in NGS reads?
PHRED quality score
embedded into FASTQ file
defined as estimated error probability
Name a tool used to address the quality of NGS reads. How is quality improved/checked?
FastQC
quality control:
trimming (Trimmomatic)
fixed length
adaptive (based on quality score)
Explain the FASTQ format.
extension to FASTA
includes PHRED quality scores (from 0 to 93)
encoded in ASCII 33–126
range of error probability
from 1 (wrong base)
to 10^-9.3 (extremely accurate)
Format:
sequence header = @ID name length=n
sequence
quality header = +ID name length=n
PHRED scores
What are some common sequence artifacts in NGS data?
read errors
base call errors
small insertions/deletions
poor quality reads
primer/adapter contamination —> trim adapters
What is a contig?
contig = contiguous sequence
longest overlap between reads
Describe how a genome is assembled.
Fragment DNA
Find overlaps
merge overlaps into contigs
merge contigs into scaffolds
Why are genomes hard to assemble?
Accuracy
no ground truth
Sequencing errors
Computationally expensive
Differentiating biology
How does high read coverage help in sequencing? Which problem cannot be overcome with it?
decreases error rates
problem: repeats
What graph-based solution for assembly would you use for long and short reads? How are repeat sequence represented in the graphs?
Long reads -> overlap graph
Short reads -> de Brujin graph (3-mers only)
-> Repeat sequences create a fork in graphs
Describe a typical RNA-seq workflow.
Quality control
Read alignment to reference genome
Transcriptome assembly
Differential expression
Name the three different transcriptome assembly strategies.
reference-based
de novo
combined
Name some splice-aware aligners. What tool can be used to create and traverse overlap graphs?
aligners:
TopHat
Blat
graph:
Cufflinks
How do you normalize read counts for genes based on read numbers and length? What is the benefit?
How? —> Raw reads/length = normalized reads
Why? —> compare different conditions/experiments
Explain some measures to normalize transcript expression.
RPKM: reads per kilobase of transcript per million mapped reads
norms by:
read length
total number of mapped reads
FPKM: fragments per kilobase of transcript per million mapped reads
only for paired-end
RPKM with dependency of paired read
TPM: transcripts per million
first: gene length
then: sequencing depth
Descrive the VCF file format.
—> Variant Call Format
-> describe variations of sequences in different conditions
FORMAT column:
-> homozygous = same copy
-> heterozygous = different copy
What is single-cell sequencing? What are the main differences to bulk RNA-seq?
single-cell:
use barcodes to identify RNA of individual cells
differences:
bulk RNA-seq:
comparative transcriptomics
disease biomarkers
homogenous systems
~20.000 mRNA transcripts
scRNA-seq:
identify rare cell populations
cell population dynamics
define heterogeneity
200-10.000 mRNA transcripts/cell
What are common applications for single-cell RNA sequencing?
Deconvolution -> cluster cell populations
Trajectory -> cell differentiation
Networks -> Gene Regulatory Networks (GRNs)
Describe the steps of Drop-seq.
Last changed4 months ago