undefined

Buffl

Genome Analysis

by Mista F.

Briefly describe the process of Sanger sequencing.

DNA polymerase -> fluorescent terminators
results in random fragments with marked terminator
sort fragments by length (gel-electrophoresis)
computer reads light signals -> color at each base

—> But:

short reads
high error rate
complicated, takes long

Name some Next-generation-sequencing approaches.

2nd Generation (collection of molecules)
- Illumina
- Roche/454
- SOLiD
3rd Generation (single molecules)
- Oxford Nanopore
- Pacific Biosciences
- Ion Torrent

Give an overview of the experimental step in an RNA-sequencing protocol.

sample extraction
target enrichment
cDNA (add adapters for PCR & amplify)
Sequencing
Mapping
Data analysis

What is the general workflow of Illumina sequencing?

Library preparation
- Fragment
- add adaptors
- amplify
Cluster generation
- attach to flow cell
- bridge amplification -> build double strand from fragment
- split bridge into two single strands
- generate clusters
Sequencing (by synthesis)
- add fluorescent nucleotides
- laser -> image -> determine base

Name different general 2nd Gen NGS approaches.

sequence-by-synthesis
- Illumina
sequence-by-ligation
- SOLiD
pyrosequencing
- 454

What is a measure of base quality in NGS reads?

PHRED quality score
embedded into FASTQ file
defined as estimated error probability

Name a tool used to address the quality of NGS reads. How is quality improved/checked?

FastQC
quality control:
- trimming (Trimmomatic)
  - fixed length
  - adaptive (based on quality score)

Explain the FASTQ format.

extension to FASTA
includes PHRED quality scores (from 0 to 93)
- encoded in ASCII 33–126
- range of error probability
  - from 1 (wrong base)
  - to 10^-9.3 (extremely accurate)
Format:
- sequence header = @ID name length=n
- sequence
- quality header = +ID name length=n
- PHRED scores

What are some common sequence artifacts in NGS data?

read errors
base call errors
small insertions/deletions
poor quality reads
primer/adapter contamination —> trim adapters

What is a contig?

contig = contiguous sequence
- longest overlap between reads

Describe how a genome is assembled.

Fragment DNA
Find overlaps
merge overlaps into contigs
merge contigs into scaffolds

Why are genomes hard to assemble?

Accuracy
- no ground truth
Sequencing errors
Computationally expensive
Differentiating biology

How does high read coverage help in sequencing? Which problem cannot be overcome with it?

decreases error rates
problem: repeats

What graph-based solution for assembly would you use for long and short reads? How are repeat sequence represented in the graphs?

Long reads -> overlap graph
Short reads -> de Brujin graph (3-mers only)

-> Repeat sequences create a fork in graphs

Describe a typical RNA-seq workflow.

Quality control
Read alignment to reference genome
Transcriptome assembly
Differential expression

Name the three different transcriptome assembly strategies.

reference-based
de novo
combined

Name some splice-aware aligners. What tool can be used to create and traverse overlap graphs?

aligners:

TopHat
Blat

graph:

Cufflinks

How do you normalize read counts for genes based on read numbers and length? What is the benefit?

How? —> Raw reads/length = normalized reads
Why? —> compare different conditions/experiments

Explain some measures to normalize transcript expression.

RPKM: reads per kilobase of transcript per million mapped reads
- norms by:
  - read length
  - total number of mapped reads
FPKM: fragments per kilobase of transcript per million mapped reads
- only for paired-end
- RPKM with dependency of paired read
TPM: transcripts per million
- norms by:
  - first: gene length
  - then: sequencing depth

Descrive the VCF file format.

—> Variant Call Format

-> describe variations of sequences in different conditions

FORMAT column:

-> homozygous = same copy

-> heterozygous = different copy

What is single-cell sequencing? What are the main differences to bulk RNA-seq?

single-cell:
- use barcodes to identify RNA of individual cells
differences:
- bulk RNA-seq:
  - comparative transcriptomics
  - disease biomarkers
  - homogenous systems
  - ~20.000 mRNA transcripts
- scRNA-seq:
  - identify rare cell populations
  - cell population dynamics
  - define heterogeneity
  - 200-10.000 mRNA transcripts/cell

What are common applications for single-cell RNA sequencing?

Deconvolution -> cluster cell populations
Trajectory -> cell differentiation
Networks -> Gene Regulatory Networks (GRNs)

Describe the steps of Drop-seq.

Join Course

Preview

Author

Mista F.

Information

Last changed
a year ago

Report course

12 - NGS

Author

Mista F.

Information