Genome of an organism - defintion (1)
the whole hereditary information of an organism that is encoded in the DNA (or, for some viruses, RNA)
includes both the genes and the non-coding sequences
a complete DNA sequence of one set of chromosomes, for example, one of the two sets that a diploid individual carries in every cell
The term genome can be applied specifically to mean the complete set of nuclear DNA (i.e. the nuclear genome) but can also be applied to organelles that contain their own DNA, as with the mitochondrial genome or the chloroplast genome
Tree of life (1)
Based on rRNA sequences, weil diese in jedem Gen enthalten sind
~3000 species
Animals, Plants, Protists, Bacteria, Archaea, Fungi
Prokaryotic cell (1)
1 Mikromezer
flagellum, capsule, nucleoid, chromosome, plasmid
Eukaryotic cell (1)
10 Mikrometer
golgi apparatus, nucleus, chromosomes, endoplasmic reticulum, mitochondria
3 domains (1)
Archaea
Bacteria
Eucarya
Chronology of genome sequencing projects (1)
1976: first viral genome, ~5000 bp
1981: Human mitochondrial genome, ~16.500 bp
1986: Chloroplast genome, ~156.000 bp
1995: first genome of a free-living organism
1996: First eukaryotic genome (S. cerevisiae)
1998: first multicellular organism
1999: first human chromosome
2000: D. melanogaster
2001: first draft sequence of human genome
the nuclear genome (1)
3.200.000.000 nucleotides of DNA
24 linear molecules, the shortest 50.000.000 nucleotides in length and the longest 260.000.000 nucleotides, each contained in a different chromosome
These 24 chromosomes consist of 22 autosomes and the two sex chromosomes, X and Y
the mitochondrial genome (1)
a circular DNA molecule of 16.569 nucleotides, multiple copies of which are located in the energy-generating organelles called mitochondria
how to sequence DNA (1)
DNA polymerase copies a strand of DNA
The insertion of a terminator base into the growing strand halts the copying process. This is a random event that results in a series of fragments of different lengths, depending on the base at which the copying stopped. The fragments are separated by size by running them thorugh a gel matrix, with the shortest fragments at the bottom and the largest at the top.
The terminators are labelled with different fluorescent dyes, so each fragment will fluoresce a particular colour depending on whether it ends with an A, C, G or T base
The sequence is ‚read‘ by a computer. It generates a ‚sequence trace‘ as shown here, with the colored peaks corresponding to fluorescent bands read from the bottom to the top of one lane of the gel
The computer translates these fluorescent signals to DNA sequence
Clone-by-clone shotgun sequencing - Allgemein (1)
Construction of clone-based physical maps produces overlapping series of clones (that is, contigs), each of which spans a large, contiguous region of the source genome.
Individual mapped clones are subcloned into smaller insert librariesm from which sequence reads are randomly derived -> thousands sequence reads per clone
The resulting sequence data set is then used to assemble the complete sequence of that clone
Clone-by-clone sequencing - Main steps (1)
an individual clone, such as a bacterial artificial chromosome (BAC), is selected
Large amount of BAC DNA is purified and fragmented
The random DNA fragments (typically 2-5 kb in size) are subcloned
sequence reads are generated from one or both ends of randomly selected subclones
the random reads are then assembled on the basis of sequence overlaps, yielding preliminary sequence assemblies (prefinished sequence)
such sequence is imperfect
gaps (breaks between the horizontal lines)
areas of poor sequence quality (thinner horizontal lines)
often, the order and orientation of some of the sequence contigs is not known
Subsequent customized sequence finishing, involving the generation of additional sequence data for closing gaps and bolstering areas of poor sequence quality, yields finished, highly accurate sequence across the entire clone
Whole-genome shotgun sequencing (1)
shotgun sequencing is a method used for sequencing random DNA strands
whole-genome sequencing is the process of determining the entirety, or nearly the entirety, of the DNA sequence of an organism's genome at a single time
Whole-genome shotgun sequencing:
the mapping phase is skipped
shotgun sequencing proceeds using subclone libraries prepared from the entire genome
typically, tens of millions of sequence reads are generated
computer-based assembly used to generate contiguous sequences of various sizes
shotgun-sequence assembly (1)
computational assembly of redundant collections of sequence reads, from which an accurate consensus sequence can be deduced
problems with tandemly repeated DNA or genome-wide repeats: can result in incorrect overlap
Clone contig approach vs. Whole-genome shotgun approach (1)
Long-range sequence assembly in whole-genome shotgun sequencing (1)
Individual sequence reads are initially assembled into sequence contigs
̈Groups of sequence contigs are then organized into scaffolds on the basis of linking information provided by read pairs (in each case, with one sequence read from a pair assembling into one contig and the other read into another contig)
̈The scaffolds can be aligned relative to the source genome by the identification of already mapped, sequence-based landmarks (for example, STSs, genetic markers and genes; depicted as red circles) in the sequence contigs, thereby associating them with a known location on the genome map
hybrid shotgun-sequencing approach (1)
Generate BAC shotgun reads and generate whole-genome shotgun reads -> Combine overlapping whole-genome and BAC-derived reads -> assemble and finish
Main components of eukaryotic genomes (1)
Introns: 25.9%
LINEs (long interspersed nuclear elements): 20.4%
SINEs (short interspersed nuclear elements): 13.1%
Miscellaneous unique sequences: 11.6%
LTR retrotransposons
DNA transposons
etc.
SINEs (1)
short interspersed nuclear elements
LINEs (1)
long interspersed nuclear elements
Analysis of the proteome encoded by genomes - approaches (1)
database similarity search
All-against-all comparison within proteome
Protein comparison between proteoms
Search of clusters
What are DNA motifs? (2)
Short, recurring patterns in DNA that are presumed to have a biological function
often they indicate sequence specific binding sites for proteins such as nucleases and transcription factors (TF)
Others are involved in important processes at the RNA level
ribosome binding
mRNA processing (splicing, editing, polyadenylation)
transcription termination
The gene control region of a typical eukaryotic gene (2)
Prokaryotes vs. Eukaryoties (2)
Prokaryotes:
fewer TFs
long motifs
affinity depends on match
immediate upstream regulation
Eukaryotes:
more TFs per gene
shorter motifs
Much more noncoding seq
regulatory modules
long range regulation
Transcription factors (2)
often dimer, tetramer: palindromic binding size (Reliefpfeiler)
binding
stochastic
affinity = structural / sequence match
high affinity not always desirable (!)
combinatorial regulation (esp. eukaryotes)
order important
site spacing important
Motif finding overview - Methods (2)
1 genome: sequence overrepresentation
Functional Genomics: predict regulons
N genomes: phylogenetic footprinting
N genomes + Functional Genomics: Phylocon, new ideas…
Regulatory elements in DNA sequences (2)
Binding sites for proteins
short substrings (5-25 nucleotides)
up to 1000 nucleotides (or farther) from gene
Inexactly repeating patterns (“motifs”)
Sequence logo (2)
showing the frequencies scaled relative to the information content (measure of conservation) at each position
idea: scale each stack of letters with some measure of conservation at each base
positions that are perfectly conserved contain 2 bits of information
those where two of the four bases occur 50% of the time each contain 1 bit
those where all four bases occur equally often contain no information
Correcting Sequence logo for background frequencies (2)
Assumption that all 4 bases occur equally often int he background genomic DNA is not always true
Energy normalized logo: using relative entropy to adjust for low GC content
Simple way to overcome the lack of data (2)
Assume that each base type occurs at least once in each alignment position. If we treat all bases and all alignment columns the same way: p_i,b = (n_i,b + 1) / (N_seq + 4) (künstlich counts hinzufügen, so tun als ob wir mehr Daten haben)
Pseudocounts (2)
additional data to overcome lack of data
A simple way to overcome the lack of data: assume that each base type occurs at least once in each alignment position
This is equivalent to increasing the amount of data in each column by four bases
More sophisticated way of adding pseudocounts (2)
take advantage of the knowledge we have of the properties of sequences
Pb - frequency of occurrence of base of type b
beta: a simple scaling parameter that determines the total number of pseudocounts in an alignment column. Advantage is that we can easily adjust the relative weighting of the pseudocounts and real data. When there are a lot of data (Nseq large) there is little need for pseudocounts, and beta should be smaller than Nseq whereas when there are less data, beta should be larger relative to Nseq.
A popular solution: beta = sqrt(Nseq). At large values of NSeq we get p_i,b = (n_i,b / N_seq)
Regular expression (2)
Regular expression: /[AT][GC][AC][ACGT]*A[TG][CG]/
The regular expression is able to find all possible sequences, but do not distinguish between the consesnsus sequence and the highly unlikely sequence: ACAC-ATC or TGCT- -AAG
weight matrixes can be used to score the sequence but do not deal with insertions and deletions
PROSITE-Datenbank (2)
Sammlung von Motifen und Mustern, die aus Proteinsequenzen gewonnen wurden und durch Durchsicht von Proteinfamilien verfeinert wurden
Database of protein domains, families and functional sites
Log-Odds score (2)
the probability of a sequence x
depends on sequence length
get infinitely small as the x becomes longer
Solution: NULL (background) model
Treats sequences as random strings of nucleotides
For example a random distribution: A=T=C=G=0.25, or overall nucleotide frequencies in the organism
The probability of sequence of length L is 0.25^L
A high log-odds score corresponds to a sequence that looks more like a gene model than the background model
Structure of the HMM - Example: Sequence given that begins with an exon, contains a 5’ splice site and ends with an intron. Identify where the switch from exon to intron occured, i.e. where the 5’ splice site is(2)
̈Sequences of exons, introns, and slice sites must have different statistical properties, e.g.
Exons: uniform base composition on average (25% each base)
Introns: A/T rich (40% each for A/T, 10% each for C/G)
The 5ʻSS consensus nucleotide is almost always a G (95% g, 5% A)
HMM invokes three states, one for each of the three labels E (exon), 5 (5 ́SS), and I (intron)
Each state has ist own
emission probabilities: model base composition
transition probabilities: describe the linear order in which we expect states to occur
Top highest score = -41.22 —> this is the most likely position of the 5’ SS
What is hidden in a Hidden Markov Model (HMM)? (2)
HMM generates a sequence
When we visit a state, we emit a residue from the state’s emission probability distribution
Then we choose which state to visit next according to the state’s transition probability distribution
The model thus generates two strings of information:
the underlying state path (the labels), as we transition from state to state
the observed sequence, each residue being emitted from one state in the state path
The state path is a Markov chain: what state we go next depends only on what state we are in
Since we are only given the observed sequence, this underlying state path is hidden: these are residue labels we’d like to infer
The state path is a hidden Markov chain
Viterbi algorithm (2)
finds the most probably state paths given a sequence and a HMM using dynamic programming
Binding energy and searching for novel sites (2)
Affinity of a DNA binding protein to a specific binding site is typically correlated with how well the site matches the consensus sequence
But..
not all positions in a binding site are equally forgiving of mismatches
not all mismatches at a given position have the same effect
Identifying Motifs (2)
Genes are turned on or off by regulatory proteins
These proteins bind to upstream regulatory regions of genes to either attract or block an RNA polymerase
Regulatory protein X binds to a short DNA sequence called a motif
So finding the same motif in multiple genes’ regulatory regions suggests a regulatory relationship amongst those genes
Identifying Motifs: Complications (2)
We do not know the motif sequence
we do not know where it is located relative to the genes start
Motifs can differ slightly from one gene to the next
How to discern it from “random” motifs?
The Motif Finding Problem (2)
Given a random sample of DNA sequences
Find the pattern that is implanted in each of the individual arrays, namely, the motif
Additional information:
The hidden sequence is of length 8
The pattern is not exactly the same in each array because random point mutations may occur in the sequences
The problem: Can we still find the motif, now that we have 2 mutations? What is the consesus sequence?
Defining motifs (2)
to define a motif, lets say we know where the motif starts in the sequence
the motif start positions in their sequences can be represented as s = (s1, s2, s3, …, st)
Line up the patterns by their start indexes
Construct matrix profile with frequencies of each nucleotide in columns
Consensus nucleotide in each position has the highest score in column
Consensus (2)
Consensus sequences help in finding motifs
Think of consensus as an “ancestor” motif, from which mutated motifs emerged
The distance between a real motif and the consensus sequence is generally less than that for two real motifs
Evaluating Motifs (2)
we have a guess about the motif starting points and consensus sequence, but how “good” is this consensus?
Need to introduce a scoring function to compare different guesses and choose the “best” one
The Motif Finding Problem: Formulation (2)
Given a set of DNA sequences, find a set of l-mers, one from each sequence, that maximizes the consensus score
Input: A t x n matrix of DNA, and l, the length of the pattern to find (with t= number of sample DNA sequences, n= length of each DNA sequence)
Output: An array of t starting positions s = (s1, s2, …, st) maximizing Score(s, DNA)
The Motif Finding Problem: Brute Force Solution (2)
Compute the scores for each possible combination of starting positions s
the best score will determine the best profile and the consensus pattern in DNA
The goal is to maximize Score (s, DNA) by varying the starting positions si, where si = [l, …, n-l+1] for i = [1, …, t]
Run time = O(ln^t) (not practical)
Pseudocode for Brute Force Motif Search (2)
The Motif Finding Problem: Deterministic optimization by Expectation Maximization (EM) (2)
the weight matrix for the motif is initialized with a single n-mer subsequence, plus a small amount of background nucleotide frequencies
for each n-mer in the target sequences, we calculate the probability that it was generated by the motif, rather than by the background sequence distribution
expectation maximization then takes a weighted average across these probabilities to generate a more refined motif model
the algorithm iterates between calculating the probability of each site based on the current motif model, and calculating a new motif model based on the probabilities
procedure converging to a maximum of the log likelihood of the resulting model
Given: set of sequences that are expected to have a common sequence pattern
initial guess: location and size in each sequence: these parts are aligned
the alignment provides an estimate of the base composition of each column
Expectation Maximization: two step procedure (2)
Step 1: expectation
column-by-column composition of the site already available is used to estimate the probability of finding the site at any position of the seqs
these probabilities are used in turn to provide new information as to the expected base distribution for each column
Step 2: Maximization
new counts of bases for each position in the site found in step 1 are substituted for the previous set
Repeat until convergence, no more changes
Result:
best location of the site in each seq
best estimate of the base composition of each column in the site
Multiple EM for Motif Elucidation (MEME) (2)
Locates one or more ungapped patterns in a single sequence of a series of sequences
Three motif models:
one expected occurrence of motif per sequence
zero or one occurrence
any number of occurrences
Can use prior knowledge
motif present in all or only some sequences
motif length
palindrome or not
expected patterns in individual motif positions (which bases are interchangeable)
Probabilistic optimization: Gibbs sampler (2)
Stochastic implementation of expectation maximization
Takes a weighted sample of subsequences (not ALL sequences)
The motif model is initialized with a randomly selected set of sites, and every site in the target sequences is scored against this initial motif model
At each iteration, the algorithm probabilistically decides whether to ass a new site and / or remoce an old site from the motif model, weighted by the binding probability for those sites
The resulting motif model is then updated, and the binding probabilities recalculated
Given sufficient iterations, the algorithm will efficiently sample the joint probability distribution of motif models and sited assigned to the motif, focusing in on the best fitting combinations
The basic Gibbs sampler algorithm (2)
Set of N sequences S1,…,Sn
Task: find in each sequence mutually similar segments of specified width W
Algorithm maintains two evolving data structures
Pattern description
Probabilistic model of residue frequencies for each position i from 1 to W, consisting of probabilities qi,1, …, qi,20
Background frequencies p1, …, p20 with which residues occur in sites not described by the pattern
Set of positions a_k (k=1,N) for the common pattern
Task: find the “best” (most probable) pattern
Locate the alignment that maximizes the ratio of the pattern probability to background probability
Initialization: random starting positions within various sequences
The basic Gibbs sampler algorithm: Predicitive update step (2)
Exclude sequence z at random
Calculate pattern description q_ij and background frequencies p_i from the current positions a_k in all sequences excluding z
The basic Gibbs sampler algorithm: Sampling step (2)
every possible segment of width W within sequence z is considered as a possible instance of the pattern
calculate:
the probabilities Q_x of generating each segment x according to the current pattern probabilities q_i,j
the probabilities P_x of generating these segments by the background probabilities p_i
The weight Ax = Q_x / P_x is assigned to each segment x
Select a random one, its position becomes the new a_z
The basic Gibbs sampler algorithm: Iterative procedure (2)
the more accurate pattern description in step 1, the more accurate the determination of its location in step 2, and vice versa
Given random positions a_k, in step 2 the pattern description q_i,j will tend to favor no particular segment
Once some correct a_k have been selected by chance, the q_i,j begin to reflect, albeit imperfectly, a pattern extant within other sequences
This process tends to recruit further correct a_k, which in turn improve the discriminating power of the evolving pattern
Practical guideline for motif discovery tools (2)
Phylogenetic Footprinting (2)
Functional sequences evolve slower than nonfunctional ones
Consider a set of orthologous sequences from different species
Identify unusually well conserved regions
-> capable of identifying regulatory elements specific even to a single gene, as long as they are sufficiently conserved across many of the species considered
Multigene approach (2)
-> requires a reliable method for assembling the requisite collection of coregulated genes
General strategies of various comparative genomics methods to discover transcription binding sites and regulons (2)
Prediction of cys-regulatory modules (CRMs) (2)
TFBS motif instances in one species
Amount of evolutionary constraint
ChIP-seq signal for TF A
(Cys-regulatory modules = regions of non-coding DNA which regulate the transcription of neighboring genes)
Prokarya (3)
Cytoplasm
Ribosomes
Nucleoid
Plasma membrane
Cell wall (Peptidoglycan, Outer membrane)
Capsule
Circular genome map (3)
shows the position and orientation of known genes, pseudogenes and repetitive sequences
Gene prediction flowchart (3)
Obtain new genomic DNA sequence -> 1. Translate in all six reading frames and compare to porten sequence database 2. Perform database similarity search of expressed sequence tag (EST) database of same organism, or cDNA sequences if available -> use gene prediction program to locate genes -> Analyze regulatory sequence in the gene
In bacteria - Central Dogma (3)
DNA transcribed to RNA translated to Protein
Ribosome (3)
Translation is handled by a molecular complex, the ribosome, which consists of both proteins & ribosomal RNA (rRNA)
Ribosome reads mRNA & the translation starts at a start codon (the translation start site)
With help of tRNA, each codon is translated to an amino acid
Translation stops once ribosome reads a stop codon (the translation stop site)
Promoters (3)
DNA segments upstream of transcripts that initiate transcription
Promoter attracts RNA Polymerase to the transcription start site
Promoter Structure on Prokaryotes (E. coli) (3)
Transcription starts at offset 0
Pribnow Box (-10)
Gilbert Box (-30)
Ribosomal Binding Site (+10)
The Shine-Dalgarno sequence (3)
The ribosome binds to the messenger RNA through basepairing to the 30S ribosomal subunit
The binding site is the Shine-Dalgarno sequence (SD)
The SD is a purine-rich sequence (consensus sequence: AGGAG) at the 5’ end of most prokaryotic mRNAs
The SD is found 5-10 basepairs upstream from the start codon
Bacterial gene finding: relatively easy (3)
Dense Genomes
Short intergenic regions
Uninterrupted ORFs
Conserved signals
Abundant comparative information
Complete genomes
Indicators of protein coding regions in bacterial DNA (3)
intrinsic evidence
sufficient ORF length. Long ORFs rarely occur by chance
Specific patterns of codon usage that are different from triplet frequencies in non-coding regions ('“coding potential”)
The presence of ribosome binding sites (RBS) ind the (-20)…(-1) regions upstream of the start codon that help to direct ribosomes to the correct translation start positions. A part of the RBS is formed by the pruine-rich Shine-Dalgarno (SD) sequence which is complementary to the 3’ end of the 16S rRNA
extrinsic evidence
Similarity to known, especially experimentally characterized, gene products
Gene-Finding Strategies (3)
Genomic Sequence -> Content-Based / Site-Based / Comparative
Content-Based: Bulk properties of sequence: Open reading frames, Codon usage, Repeat periodicity, Compositional complexity
Site-Based: Absolute properties of sequence: Consensus sequences, Donor and acceptor splice sites, Transcription factor binding sites, Polyadenylation signals, “Right” ATG start, Stop codons out-of-context
Comparative: Inferences based on sequence homology: Protein sequence with similarity to translated product of query, Modular structure of proteins usually precludes finding complete gene
Intrinsic Approaches (to find Genes?) (3)
GeneMark
GLIMMER
EcoParse
…
GeneMark (3)
non-homogeneous Markov models for DNA regions that code for proteins or are complementary to them
homogeneous Markov models for non-coding regions
coding capactity of sliding windows is deduced throguh a Bayesian decision rule
GLIMMER (3)
interpolated Markov Models
takes into account DNA oligomers of varying length dependent on the local composition of the sequence
EcoParse (3)
hidden Markov models
finds the maximum likelihood parse of a DNA sequence into coding and non-coding regions
no sliding windows used
Open Reading Frames (ORFs) (3)
Detect potential coding regions by looking at ORFs
A genome of length n is comprised of n/3 codons
Stop codons break genome into segments between consecutive Stop codons
the subsegments of these that start from the start codon (ATG) ar ORFs
ORFs in different frames may overlap
Long open reading frames may be a gene:
at random, we should expect one stop codon every (63/3) ~= 21 codons
However, genes are usually much longer than this
A basic approach is to scan for ORFs whose length exceeds certain threshold
this is naive because some genes (e.g. some neural and immune system genes) are relatively short
Statistical properties of protein coding regions (3)
Factors that contribute to the unequal usage of codons in a coding sequence:
Unequal use of amino acids
Unequal number of codons for different amino acids
Codon preference
If reading frame 1 encodes a protein, it will influence the following factors:
The amino acid composition in both the coding frame and the other two frames (frames 2 and 3)
The codon composition of all three frames
Positional base frequency: the frequency with which each of the four bases occupies each of the three positions within codons
TESTCODE (Fickett, 1982) (3)
In protein coding regions every third base tends to be the same one much more often than by chance alone due to non-random use of codons
Method:
Count the number of each base at every third position starting at positions 1, 2, and 3, and going to the end of the sequence window
The assymetry statistic for each base is calculated as the ratio of the maximum count of the three possible reading frames divided by the minimum count for the same base plus 1
The frequency of each base is the window is also calculated
The resulting assymetry and frequency scores are converted to probabilities of being found in a coding region
HMM for gene finding (3)
HMMs are able to model grammar
Many problems in biological sequence analysis have a grammatical structure
Example: the grammar of eukaryotic gene structure (simplified)
Exons and introns are “words”
Sentences: exon-intron-exon-…-intron-exon
Sentences can never end with an intron
Exon can never follow an exon without an intron in between
Other constraints…
HMM architecture for a parser for E. coli DNA with a simple intergenic model & How HMM generates a sequence of nucleotides (3)
How HMM generates a sequence of nucleotides:
Process: random walk starting in the middle of any of the HMMs
begin at any state (e.g. central) and enter any of the rings
each such state transition has an associated probability
transitions out of the central state are chosen at random according to these probabilities (they sum to one)
subsequently, a new transition out of the central state is selected randomly and independently of the previous transition
Choosing one of the 61 codon models repeatedly results in a ‘random gene’
Gene termination
entry into one of the rings below the central state
low probability
one stop codon HMM generates both TAA and TGA,each according to its frequency of occurrence in E. coli, and the other TAG
Intergenic region
produced independently and at random by looping in the state labelles ‘Intergene model’
Start codon HMM
generates either ATG, GTG or TTG, each with the appropriate probability (TGG is very rare in E. coli)
A transition is made back to the central state and the whole process repeated
Result: a sequence of nucleotides that is statistically similar to a contig of E. coli DNA consisting of a collection of genes interspersed with intergenic regions
Finding the most likely random walk (Viterbi algorithm) (3)
A dynamic programming method known as the Viterbi algorithm
Generates a parse of the contig: labels genes in the DNA by identifying portions of the path that begin with the start codon at the end of the intergenic ring, pass through several amino acid codon HMMs, and return to one of the stop codons at the beginning of the intergenic ring
The model parses a gene in one direction only and thus finds all genes on the direct strand
To locate genes on the opposite strand, the reverse complement is parsed
A parser with a complex intergenic model (3)
Methods for modeling inter-codon dependencies (3)
Glimmer
Modeling inter-codon dependencies: GeneMark (3)
Fifth-order Markov model
uses sequence information from the previous five bases
the frequency of hexamers is used to differentiate between coding and noncoding sequences
limitation: there must be many representatives of each hexameric sequence in genes
uses a Markov chain model to represent the statistics of coding and noncoding reading frames
Uses dicodon statistics to identify codon regions
fifth-order Markov chains consist of terms P(a|x1x2x3x4x5) which represent the probability of the sixth base of sequence x being a given that the previous five bases were x1x2x3x4x5
GeneMark text output:
GeneMark graphical output:
Different types of markov models that output nucleotide sequence (3)
Homogeneous fifth-order Markov model, with the five states i-5 to i-1 generating state i. Each state corresponds to a nucleotide.
Three-periodic fifth-order Markov models, each modeling a different DNA reading frame. The probabilities are dependent on the position of the base within the codon. Each state is labeled with the codon position of the represented base.
Gene classes in E. coli (3)
Three classes: class I, II, III
differ not only in the statistical but in the biological sense
Class I genes (in E. coli) (3)
intermediate codon usage bias
maintain a low or intermediate level of expression
some genes may occasionally be expressed at a very high level in environmentally triggered (rare) conditions
Class II genes (in E. coli) (3)
high codon uasge bias
highly expressed under exponential growth conditions
Class III genes (in E. coli) (3)
low codon usage bias
mainly belong to plasmids and insertions sequences
also includes genes coding for fimbriae, major pili, many membrane proteins, restriction endonucleases and lambdoid phage lysogeny control proteins
can be expressed at a fairly high level
Dealing with different gene pools (3)
Protein-coding sequences in a bacterial genome may not be homogeneous in their compositional features
Program using models trained on the bulk set of protein-coding sequences will be insensitive in finding genes of minor inhomogeneity classes
If preliminary information on gene classes is available, class-specific models of protein-coding regions improve the performance of the gene-finding method
possible solution:
use a set of “long” ORFs to obtain parameters of Markov models of protein-coding and noncoding regions
initial models used to score the putative sequences and to form the cluster seeds for the class-specific training sets
run clusterization procedure until convergence
obtain several sets of presumably coding sequences with more homogeneous compositional features
Modeling inter-codon dependencies: Glimmer (3)
Interpolated Markov Model
finds a sufficient number of patterns by searching for the longest possible patterns that are represented in the known gene sequences up to a length of eight bases
if there are not enough hexameric sequences, then pentamers or smaller may be more highly represented
in other cases many representative patterns even longer than six bases may be found
the longer the patterns, the more accurate the prediction
combines porbability estimates from the different-sized patterns, giving emphasis to longer patterns and weighting more heavily the patterns that are well represented in the training sequences
ORPHEUS (3)
Gene prediction in bacterial genomes
incorporation of extrinsic and intrinsic information
Consideration of coding potential and ribosome-binding sites
high accuracy
precise identification of gene starts
completely automatic prediction
Eukaryotic Gene Structure (4)
Eukaryotic gene prediction (4)
Promoter prediction
Splice site prediction
Coding potential
Exon-intron structure
The start and stop signals of eukaryotic transcription (4)
All the signals are short sequences that bind enzymes involved in this complex process
Core promoter (4)
a binding site for RNA polymerase and general transcription factors
Eukaryotes have three different RNA polymerases that are responsible for transcribing different subsets of genes (4)
RNA polymerase I transcribes genes encoding ribosomal RNA
RNA polymerase II transcribes genes encoding mRNA and certain small nuclear RNAs
RNA polymerase III transcribes genes encoding tRNAs and other small RNAs
Eukaryotic Polymerase II Promoter (4)
TBP - TATA binding protein
DPE - downstream promoter element
BRE - TFIIB recognition element
Sequences at the intron-exon boundary (4)
5’ Splicing Site = GU
Branch Point = A
3’ Splicing Site = AG
Splicing of pre-mRNA (4)
The splicing reaction proceeds in two steps
step:
cleavage at the 5’ splice site (SS)
joining of the 5’ end of the intron to an A within the intron (the branch point)
this reaction yields a lariat-like intermediate in which the intron forms a loop
cleavage at the 3’ splice site and simultaneous ligation of the exons
result: excision of the intron as a lariat-like structure
Spliceosome (4)
large assembly of 5 RNA and over 150 proteins that performs pre-mRNA splicing in eukaryotic cells (50-60S)
Splice-site prediction (4)
most introns start with a GT dinucleotide (donor splice site)
the 3’ end of introns is mostly AG dinucleotide (acceptor splice site)
Locating occurances of AG and GT would identify all possible splice sites, but in addition there would be about 30 to 100 false predicted sites for every true one
Key issue: the properties of the surrounding sequence
There are extensive species specific sequence signals available to help detect splice sites
Outline of eukaryotic gene prediction (4)
Exon-intron structure of genes
Models of gene grammar
Models of exon-intron sequence
Integrating intrinsic, extrinsic information
the RNA splicing code
Gene Finding Challenges (4)
Need the correct reading frame
introns can interrupt an exon in mid-codon
There is no hard and fast rule for identifying donor and acceptor splice sites
signals are very weak
Overpredicting Genes (4)
Easy to predict all exons
Report all sequences flanked by ..AG and GT.. as exons
Sensitivity = 100%
Specificity ~ 0%
Gene Prediction (4)
“isolated” methods
predict individual features
e.g. splice sites, coding regions
NetGene (Neural network)
“integrated” methods
predict genes in context
“grammar” of genes
Certain elements in specific order are required
HMMgene
GenScan (HMM-based)
Gene Features modeled by GenScan (4)
Semi-Markov HMM Model of human gene structure and composition
Features modeled:
Hexamer composition of exons / introns
Extended 5’ and 3’ splice signals
Reading frame consistency of exons
Exon / intron length distributions
Promoter and polyA signals
Isochore differences
Isochores (4)
very long stretches of DNA that are homogeneous in base composition
are compositionally orrelated with the coding sequences that they embed
isochore organization reflects some fundamental aspects of genome organization and evolution
the nuclear genomes of vertebrates are mosaics of isochores
isochores can be partitioned into a small number of families that cover a range of GC levels (30-60%)
determining the underlying mechanism driving the evolution of isochores is a major issue in understanding the organization of genomes
GC content of isochores correlated with… (4)
gene density
intron length
replication timing
recombination
methylation pattern
distribution of transposable elements
Semi-Markov and Hidden Semi-Markov (4)
GenScan (4)
Designed to predict complete gene structures
introns and exons, promoter sites, polyadenylation signals
Incorporates:
Descriptions of transcriptional, translational and splicing signal
Length distributions (Explicit State Duration HMMs)
Compositional features of exons, introns, intergenic, C+G regions
Larger predictive scope
Deal with partial and complete genes
Multiple genes separated by intergenic DNA in a seq
Consistent sets of genes on either / both DNA strands
Based on a general probabilistic model of genomix sequences composition and gene structure
Assigns probability to every possible gene structure compatible with the sequence
uses dynamic programming to determine the most probable gene structure
GenScan Architecture (4)
It is based on Generalized HMM (GHMM)
Model both strands at once
other models: predict on one strand first, then on the other strand
avoids prediction of overlapping genes on the two strands (rare)
Each state may output a string of symbols (according to some probability distribution)
Explicit intron / exon length modeling
Special sensors for Cap-site and TATA-box
Advanced splice site sensors
Parallel unsupervised training and gene prediction (4)
all parameters of the model with reduced architecture are initialized
GeneMark is run to determine a genomic sequence parse into “coding” and “non-coding” regions and the input genomic sequence is labeled according to this parse
the subsets of uniformly labeled fragments are used for re-estimation of parameters
Repeat until convergence
Initial choice of model structure and parameters (4)
Donor (acceptor): just two canonic GT (AG) dinucleotides
Initiation (termination) sites: canonic (ATG, TGA, TAG, TAA)
Sequences emitted by non-site states: uniform length distributions
Non-coding: zero-oder Markov model, parameters estimated based on nucleotide frequencies in the genome
Coding: different approaches, e.g. trained on long ORFs
Similarity based methods for gene finding (4)
Comparison to EST databases
Comparison of genomic sequence translated to protein database
Spliced alignment
Comparison of translated genomic sequence to translated genome / cDNA sequence
Comparison of genomic sequence with homologous genomic sequence from close organisms
Cis alignment (4)
the alignment of a cDNA sequence to the locus that matches it best in its source genome - the presumed template for its transcription
Trans alignment (4)
The alignment of a cDNA or protein sequence to a homologous locus other than the one from which it was transcribed
use of cDNA and EST sequences for genome annotation (4)
cDNA is created by reverse transcription of RNA to DNA, a process that often terminates before reaching the 5 end of the RNA
Currently, the most abundand type of cDNA sequence in databases is obtained by creating a cDNA library and selecting clones at random for sequencing. This often results in high-copy-number mRNAs being overrepresented whereas low-copy-number mRNAs are missed entirely (each cDNA color represents a different gene)
Within this category of ‘random clone’ cDNA, most of the available sequences are ESTs - single sequencing reads of typically 500-700 nucleotides that are taken from one end of the clone insert. Clones can be sequenced from both ends, but even two end-reads might not cover the entire insert
GenomeScan Objectives (4)
Combine probabilistic ‘extrinsic’ information (BLAST hits) with a probabilistic model of gene structure / composition
Make method efficient and reliable enough to run on an entire vertebrate genome without human supervision
Focus on ‘typical case’ when homologous but not identical proteins are available
Similarity-Based approach to gene prediction (4)
genes in different organisms are similar
the similarity-based approach uses known genes in one genome to predict (unknown) genes in another genome
Given a known gene (or a protein) and a genome sequence, find a set of substrings of the genomic sequence whose concatenation best fits the gene
e.g. known frog gene is aligned to different locations in the human genome —> find “best” path to reveal the exon structure of human gene
use local alignments to find all islands of similarity and then look for a maximum chain of substrings (chain = a set of non-overlapping nonadjacent intervals)
Chaining local alignments (4)
find substrings that match a given gene sequence (candidate exons)
define structure of candidate exons as (l, r, w) (left, right, weight defined as score of local alignment)
look for a maximum chain of substrings
chain: a set of non-overlapping nonadjacent intervals
Exon chaining problem (4)
Exon Chaining Problem: Given a set of putative exons, find a maximum set of non-overlapping putative exons
Locate the beginning and end of each interval (2n points)
find the “best” path
Input: a set of weighted intervals (putative exons)
Output: a maximum chain of intervals from this set
this problem can be solved with dynamic programming in O(n) time
Dual-genome de novo gene prediction
Idea: combine information from mouse-human alignments with models of the DNA sequences that characterize splice donors and acceptors, start and stop codons and other biological features
the first programs to outperform GENSCAN by using mouse-human comparison were TWINSCAN and SGP2
their success resulted, in part, from using genome alignments to modify the scoring schemes of successful single-genome de novo gene predictors
TWINSCAN (4)
included models of conservation in splice sites and start and stop codons
the primary effect of mouse-human alignments was to eliminate many of the false-positive genes and exons predicted by GenScan
SGP2 (4)
considered only the conservation in protein-coding regions
Choice of informant genome (4)
Substitutions per synonymous site might not be a good predictor of the usefulness of informant genomes
the number of substitutions per synonymous site is calculated using only proteins that can be aligned
does not account for the loss of alignability at greater evolutionary distances
for flies, a good predictor is the total number of mismatches in the whole-genome alignment divided by the length of the target genome
when two genomes are too diverged to be useful, the number of mismatches is low because most of the sequence cannot be aligned; when they are too close to be useful, the number of mismatches is low because most of the sequence is unchanged. However, better models for the dependence of informant utility on divergence are needed
Multiple-genome de novo gene prediction (4)
When multiple mammalian genomes became available, using them to improve on the stat-of-the-art in de novo gene prediction proved more difficult than anticipated
The first program that could make use of multiple informant genomes and could predict entire ORFs more accurately than TWINSCAN was N-SCAN (N-SCAN more accurate than TWINSCAN)
Recently, a program called CONTRAST has extracted bigger gains in human gene prediction from multi-genome alignments. This work suggests that using both the mouse and the opossum, which is slightly more diverged, will give the best improvement over using the mouse alone
Sequence test set (4)
Source: GeneBank, vertebrate divisions
extract all genomic DNA sequences encoding at least one complete protein coding gene
Discard sequences:
encoding at least one incomplete protein product
ambiguous location of protein coding regions
encoding protein coding genes in the complementary strand
encoding genes defined in other entries
pseudogenes
more than one alternatively spliced forms
encoding genes without introns
protein coding sequences not starting with ATG
no stop codon
genelerngth != 3*proteinlength
… etc
Result: 570 genes (set ALLSEQ)
Additionally: NEWSEQ (no similarity to sequences deposited before 1993)
Measuring success (4)
by nucleotide
Sensitivity / Specificity (Sn/Sp)
by exon
Sn/Sp
Missed exons (ME), wrong exons (WE)
By gene
Missed genes (MG), wrong genes (WG)
Average overlap statistics
Sensitivity (4)
Sn = TP / (TP + FN)
Specificity (4)
Sp = TP / (TP + FP)
“Joined” genes (4)
JG = #Actual genes that overlap predicted genes / #Predicted genes that overlap one or more actual genes
JG > 1, tendency to join multiple actual genes into one prediction
“split” genes (4)
SG = #Predicted genes that overlap actual genes / #Actual genes that overlap one or more predicted genes
SG > 1, tendency to split actual genes into separate gene predictions
GFF (General Feature Format) Specifications Document — Fields
(4)
seqname: the name of sequence
source: source of this feature
feature: the feature type name
start, end: Integers, start must be less than or equal to end
score: float, „.“ when there is no score
strand: +, -, „.“ or empty when strand is not relevant, e.g. for dinucleotide repeats
frame: 0, 1, 2, „.“;
0 = specified region is in frame
1 = one extra base
2 = the third base of the region is the first base of a codon
. = the frame is not relevant
(attribute, comments)
Pseudogenes (5)
Nonfunctional sequences of genomic DNA that are originally derived from functional genes, but exhibit such degenerative features as premature stop codons and frameshift mutations that prevent their expression
Characterized by close similarities to one or more paralogous genes, yet is non-functional (failure of transcription or translation)
A fundamental feature of pseudogenes is that their nucleotide sequences differ from those of the paralogous functional genes at crucial points
Two types of pseudogenes:
Conventional
Processed
Conventional pseudogenes (5)
Gene that has been inactivated because its nucleotide sequence has changed by mutation
many mutations have only minor effects on the activity of a gene but some are more important and it is quite possible for a single nucleotide change to result in a gene becoming completely non-functional
Once a pseudogene has become non-functional it will degrade through accumulation of more mutations and eventually will no longer be recognizable as a gene relic
Nonprocessed (conventional) pseudogenes (5)
Duplication of gene A results in two equivalent gene copies
Selection pressure need be applies to only one gene copy (top) to maintain the presence of the original functional gene product
the other copy (bottom), will continue to be expressed but, in the absence of selection pressure to conserve its sequence, will accumulate mutations (vertical bars) relatively rapidly
It may acquire deleterious mutations and become a nonfunctional pseudogene which may continue to be expressed at the RNA level for some time, but which will eventually be transcriptionally silent (ψA)
In some cases, however, the mutational differences may lead to a different expression pattern or other property that is selectively advantageous (A2)
In the case of tandem gene duplication, subsequent sequence exchanges between the two copies (by mechanisms such as unequal crossover) will act as a brake on the rate of sequence divergence between the two gene copies
Processed pseudogene (5)
Arises not by evolutionary decay, but by an abnormal adjunct to gene expression
Derived from the mRNA copy of a gene by synthesis of a DNA copy which subsequently re-inserts into the genome
Because a processed pseudogene is a copy of an mRNA molecule, it does not contain any introns that were present in its parent gene
It also lacks the nucleotide sequences immediately upstream of the 5’-UTR of the parent gene, which is the region in which the signals used to switch on expression of the parent gene are located
The absence of these signals means that a processed pseudogene is inactive
Origin of processed pseudogene (5)
A processed pseudogene is thought to arise by integration into the genome of a copy of the mRNA transcribed from a functional gene. The process by which mRNA is copied into DNA is called reverse transcription and the product is called complementary DNA (cDNA). The cDNA may integrate into the same chromosome as its functional parent, or possibly into a different chromosome
Origin of pseudogenes (5)
The majority of vertebrate pseudogenes are probably derived from functional genes
Mechanism of pseudogene formation (5)
The majority of vertebrate pseudogenes are a result of retrotransposition of transcripts derived from genes that encode functional proteins
Pseudogene location (5)
Pseudogenes persist in parts of the genome where they do not have a deleterious effect on fitness of the organism
Pseudogene fate (5)
Most pseudogenes undergo genetic drift and are never transcribed. By contrast, in some instances there appears to selectional pressures that prevent major changes to the pseudogene sequence. A few pseudogenes are involved in gene conversion and a few can be transcribed. Accordingly, not all pseudogenes are unequivocally functionless.
Why analyse pseudogenes? (5)
Because of their high sequence similarity to the corresponding functional genes, pseudogenes can often interfere with PCR or in situ hybridization experiments intended for the functional genes
Pseudogenes provide a molecular record on the dynamics and evolution of genomes
rate of nucleotide substitutions
the rate of DNA loss
Improvement of the gene prediction and annotation efforts
Types of pseudogenes (5)
“True” processed pseudogene (high confidence)
shares high sequence similarity with a known human protein from SWISS-PROT or TrEMBL
when aligned with the functional human protein sequence, the alignment does not contain gaps longer than 60 bp
covers >70% of the protein-coding sequence (CDS)
contains frame disruptions such as frameshifts or in-frame stop codons
“Putative” processed pseudogenes
conditions 1-3 fulfilles
condition 4 not fulfilles
probably young processed pseudogenes that were inserted into the genome so recently that they have not accumulated frame disruptions yet
“Disrupted” processed pseudogenes
conditions 1, 3, 4 fulfilled
contition 2 not fulfilled
Number of processed pseudogenes vs. chromosome length (5)
the number of processed pseudogenes on each chromosome is proportional to the chromosome length (correlation coefficient 0.92)
Consistent with the random nature of the retrotransposition process that gave rise to the processed pseudogenes
As a comparison, the correlation between the numbers of the functional genes and chromosome length is much lower at 0.69
Sequence completeness (5)
defined as the ratio between the length of the predicted protein sequence from the pseudogene and the length of the closest matching protein sequence from SWISS-PROT or TrEMBL
Altough 70% was used as the sequence completeness threshold to separate the processed psuedogenes from the pseudogenic fragments, the majority of the processed pseudogenes are practically full length
correlation between the sequence completeness and the nucleotide sequence identities (5)
There is a very significant correlation between the sequence completeness and the nucleotide sequence identities
this is because the most recent pseudogenes should be more complete and have higher sequence identities than the older ones
Positive selection vs. no selection, Ka / Ks ratio (5)
Ratio between the nonsynonymous rate of substitution (Ka) and the synonymous rate of substitution (Ks), commonly referred to as the Ka/Ks ratio, to test for natural selection on genes or proteins
The majority of human genes undergo “purifying selection,” the evolutionary process disfavors nucleotide mutations that cause detrimental amino acid substitutions in the protein thus keeps the protein as it is.
For these genes, Ka is usually much smaller than Ks, that is, Ka/Ks «1.
In rare cases, genes have Ka much greater than Ks, that is, it is to the advantage of the organism
to change or diversify the protein product of the genes (positive selection)
Example of genes under positive selection
genes involved in the host immune defense system that often coevolve with the proteins of invading pathogens
Processed pseudogenes are generally nonfunctional and presumably were released from selection pressure after being retrotransposed.Thus, they are expected to have similar values for Ka and Ks, that is, Ka/Ks ~ 1
The majority of the genes have a Ka/Ks ratio between 0.4 and 0.7
Splicing of pre-mRNA (6)
step
this reaction yields a lariat-like intermediate, in which the intron forms a loop
result: exicision of the intron as a lariat-like structure
Spliceosome (6)
The splicing machine (6)
negative control of alternative RNA splicing (6)
Negative control, in which a repressor protein binds to the primary RNA transcript in tissue 2, thereby preventing the splicing machinery from removing an intron sequence
positive control of alternative RNA splicing (6)
Positive control, in which the splicing machinery is unable to efficiently remove a particular intron sequence without assistance from an activator protein
Alternative splicing (6)
the assumption that each pre-mRNA follows a single splicing pathway was shown to be incorrect when alternative splicing was discovered
there is also sex-specific alternative splicing, e.g. of sxl pre-mRNA
is common in many eukaryotes
the primary transcripts of some genes can follow two or more alternative splicing pathways, enabling a single transcript to be processed into related but different mRNAs and hence to direct synthesis of a range of proteins
it is now believed that at least 35% of the genes in the human genome undergo alternative splicing
Types of alternate splicing (6)
How does alternative splicing occur? (6)
some splice sites are used only some of the time
Types of AS:
constitutive: more than one product is always made from the transcribed gene
regulated: different forms are generated at different times, under different conditions, or different cell tissue types
Known Roles of Alternative Splicing (6)
add new protein parts: 75% of alternative splicing involves the protein coding region, in addition to truncations you can change overall protein sequence
Influence RNA function: alternative splicing does occur to alter 5’ and 3’ UTR regions - proposed roles in subcellular localization and RNA stability
Coordinated Regulation of Biological Events
Neuron development (Dscam)
Channel activity associated with hearing (slo)
Muscle contraction
Neurite growth
Cell differentiation
Apoptosis
consequences of new protein parts due to alternative splicing (6)
Alter protein binding properties, e.g. receptor / ligand
Alter intracellular localization, e.g. membrane insertion
Alter extracellular localization, e.g. secretion
Alter enzymatic or signaling activities
Alter protein stability, e.g. inclusion of cleavage sites
Insertion of post-translation modification domains
Change ion channel properties
Computational identification of alternative splicing (6)
Insertion and deletion in ESTs relative to mRNA are identified as potential alternative splices
Splices are identified and intronic splice junction donor and acceptor sites are checked
Alternative splices are detected when two splices are mutually exclusive (intron inclusions are not identified as alternative splices). As intronic seqeunces at splice junctions are highly conserved, they can be used to verify candidate splices
Experimental analysis of alternative splicing (6)
Alternative splicing can be verified by RT–PCR using primers that flank the alternatively spliced region. The relative abundance of different isoforms in various tissue sample can be assessed from gel
High-throughput identification of alternative splicing can be carried by using microarrays. The microarray probes would consist of exon–exon junction sequence, as different alternative splice forms will have different exon–exon junctions. By analyzing the tissue distribution of various splice forms, clues regarding the regulation of alternative splicing can be obtained.
A measure of dissimilarity between mRNA isoforms (6)
Computation of splice junction difference ratio (SJD).
The SJD value for a pair of transcripts is computed as the number of splice junctions in each transcript that are not represented in the other transcript, divided by the total number of splice junctions in the two transcripts, in both cases considerung only those splice junctions that occur in portions of the two transcripts that overlap.
Comparison of alternative mRNA isoforms across human tissues (6)
Specific human tissues such as the brain, testis and liver, make more extensive use of AS in gene regulation
These tissues have also diverged most from other tissues in the set of spliced isoforms they express
Functional annotation of inferred protein isoforms (6)
Generate mRNA isoform sequences
Generate protein isoform sequences
find annotations from databases like Swissprot, by homology
Search protein domain databases like SMART, PFAM
Increase of functional diversity by alternative splicing (6)
Goals:
quantify the effect of alternative splicing on protein domains
map alternative splicing regions on protein functional and structural units
Approach:
extract all alternatively spliced protein isoforms in higher organisms with fully sequenced genomes (H. sapiens, M. musculus, D. melanogaster, C. elegans)
4804 splicing variants of 1780 proteins
map the alternatively spliced regions onto protein domain annotations of the InterPro resource
Types of alternative splicing annotated in the splicing graph gallery (6)
Alternative Splicing Gallery (ASG): viele Daten in komprimierter Darstellung: je dicker der Pfad desto mehr
Exon inclusion level (6)
defined as the fraction of the gene’s transcripts that include this exon
Constitutive exons:
Exons included in every transcript of a gene
are almost always found to be conserved in the other genome
major-form exons:
included in the majority of transcripts
minor-form exons:
included in only a minority of transcripts
are mostly not conserved in the other genome
DNA Sequence Variations (7)
There are different categories of DNA sequence variations
Single nucleotide polymorphisms (SNiPs)
agcttctatct
agcttctctct
Single Tandom Repeat Polymorphisms (STRs)
agtctctctctctctctctctctctatacg : (CT)_11
agtctctctctctctctctatacg : (CT)_8
Insertions / Deletions (Indels)
Single Nucleotide Polymorphism (7)
SNP occurs when a single nucleotide replaces one of the other three nucleotide letters
e.g. the alteration of the DNA segment AAGGTTA to ATGGTTA
Most SNPs are found outside of “coding seqs”
=> SNPs found in a coding seq are of great interest as they are more likely to alter function of a protein
SNPs are the most common type of genetic variation in humans. They account for 90% of the variation between individuals
Most are neutral polymorphisms. Some cause disease
The density of SNPs is about 1 every 100 to 300 bases
SNPs may occur anywhere
in coding regions (cSNPs)
in introns
in regulatory regions of genes
in intergenic regions
In coding regions, changes may be synonymous or nonsynonymous
SNPs and disease (7)
SNPs may be informative with respect to disease:
Functional variation. A SNP associated with a nonsynonymous substitution in a coding region will change the amino acid sequence of a protein
Regulatory variation. A SNP in a noncoding region can influence gene expression
Association. SNPs can be used in whole-genome association studies. SNP frequency is compared between affected and control populations
Allele (7)
One of the forms of a variant that occurs at a given locus
Coding (7)
In a region of the genome that is transcribed
Haplotype (7)
The organisation of variation across a chromosome (Gesamtzahl and Mutationen ?)
Missense mutation (7)
A variant alters a codon to substitute one amino acid for another
Nonsense mutation (7)
A mutation that introduces a stop codon
Rare variant (7)
A variation where the least common allele occurs less than 1 per cent in the population
SNP (single nucleotide polymorphism) (7)
An inherited single nucleotide substitution between individuals of a species. Commonly defined as having the least frequent allele occur at a rate greater than 1 per cent in a population. The most common form of human variation.
SNP functional classes (7)
Coding SNPs (cSNP): Positions that fall within the coding regions of genes
Regulatory SNPs (rSNP): Positions that fall in regulatory regions of genes
Synonymous SNPs (sSNP): Positions in exons that do not change the codon to substitute an amino acid
Non-synonymous SNPs (nsSNP): Positions that incur an amino acid substitution
Intronic SNPs (iSNP): Positions that fall within introns
Primary discovery method for polymorphisms (7)
by sequencing DNA and comparing the sequences
OMIM (7)
Online Mendelian Inheritance in Man
An online catalog of Human genes and genetic disorders
Nonsynonymous single nucleotide polymorphisms (7)
a nonsynonymous or missense variant is a single base change in a coding region that causes an amino acid change in the corresponding protein
If a nonsynonymous variant alters protein function, the change can have drastic phenotypic consequences
Most alterations are deleterious and so are eventually eliminated through purifying selection
However, beneficial mutations can sweep through the population and become fixed, thus contributing to species differentiation
Two databases contain disease-causing variants & importance of nonsynonymous substitutions in humans (7)
Online Mendelian Inheritance in Man (OMIM)
Human Gene Mutation Database (HGMD)
In both databases, nonsynonymous changes account for approximately half of the genetic changes known to cause disease
Although these databases contain information primarily concerning disorders caused by single Mendelian lesions, it is likely that nonsynonympus changes will play a similarly important role in complex diseases because of their potentially large impact
Prediction of substitution effects — Possible approaches (7)
Disease-causing mutations are more likely to occur at positions that are conserved throughout evolution
prediction could be based on sequence homology
Disease-causing AASs have common structural features that distinguish them from neutral substitutions
structure could also be used for prediction
Calculation of mutation spectra (7)
Use the matrix of neighbor-dependent nucleotide mutation rates
calculated on the basis of substitutions in aligned human gene sequences
the relative mutation rates were calculated for the four nucleotides in all 16 possible 5’ and 3’ neighborhoods
Obtaining the expected amino-acid mutation frequencies for a given collection of genes
simulate all possible single nucleotide mutations with appropriate rates
record the corresponding amino-acid changes
Relative Entropy (7)
where the summation is over all amino-acid types n in the alignment; P(n) is the probability of the amino acid n in the column corresponding to mutation; Q(n) is the probability of the amino acid n in all columns of the multiple sequence alignment
Grantham score (GR) (7)
a measure of dissimilarity between a human amino acid and the residues seen at the same site in homologs
where D(A,B) is the Grantham measure of chemical dissimilarities between amino-acid residues A and B, Human_RES is the human residues at the mutation site, RES(i) is the amino acid from the i-th aligned sequence homolog at the mutation site, and n is the number of aligned sequences
Relative mutation probability (conditional probability) (7)
‘descriptor‘: solvent accessibility or evolutionary conservation of the mutation site
P(descriptor|disease): the probability that a disease mutation has a given descriptor value
P(descriptor): the probability that a random mutation (disease or non-disease) has a given descriptor value
P(disease): the probability that a random mutation will cause a genetic disease
Importantly, because P(disease) is unknown, we can only estimate P(disease|descriptor) up to a constant (assuming certain P(disease) value). Consequently, P(disease|descriptor) is a relative mutation probability. The probability that a random mutation has a given descriptor value P(descriptor) can be estimated by simulating random single-nucleotide mutations using the expected amino-acid mutation frequencies
Tertiary Templates for proteins - use of packing criteria in the enumeration of allowed sequences for different structural classes (7)
Assumptions:
each class of protein has a core structure that is defined by internal residues
external, solvent-contacting residues contribute to the stability of the structure, are of primary importance to function, but do not determine the architecture of the core portions of the polypeptide chain
Goal: supply a list of permitted sequences of internal residues compatible with a known core structure
Algorithm (of packing criteria?) (7)
the template is derived using the fixed positions for the main-chain and beta-carbon atoms in the test structure and selected stereochemical rules
use of two packing criteria:
avoidance of steric overlap
complete filling of available space
Additional criteria
potential polar group interactions
disulfide bonds
possible burial of charges
Side-chain rotamer library
Templates help in deciding whether a sequence of unknown tertiary structure fits any of the known core classes
Summary (of mutations?) (7)
Size changes in the hydrophobic core
Introduction of buried charged residues
Disruption of protein-protein interactions
Disruption of a hydrogen-bonding network
Interference with DNA binding
Breaking disulphide covalent bonds
Mutation of catalytic residues
Mutation of metal-binding residues
Disrupting quartenary structure
Flowchart for Amino acid substitution (ASS) prediction (7)
SIFT (7)
Input: Protein sequence and AAS, Protein sequence alignment and AAS, dbSNP id, or protein id
Output: Score ranges from 0 to 1, where 0 is damaging and 1 is neutral
Algorithm: Using sequency homology, scores are calculated using position-specific scoring matrices with Dirichlet priors
SIFT: sort intolerant from tolerant substitutions
takes a query sequence and uses multiple alignment information to predict tolerated and deleterious substitutions for every position of the query sequence
Procedure:
search for similar sequences (BLAST)
choose closely related sequences that may share similar function
obtains multiple alignment
calculate normalized probabilities for all possible substitutions at each position
Substitutions at each position with normalized probabilities less than a chosen cutoff are predicted to be deleterious; those greater than or equal to the cutoff are predicted to be tolerated
SIFT — Application (7)
Because SIFT uses sequence homology rather than protein structure, it could potentially analyze a larger number of nonsynonymous SNPs than studies based on protein structure alone
Because SIFT is an automated, relatively quick procedure, it can be used to predict which missense variants are likely to be deleterious and thus hone in on which ones are likely candidates for disease and which proteins should be subjected to further investigation
SIFT can also be applied to large-scale, reverse genetic projects in which mutations are introduced randomly in the genome of an experimental organism, altered genes are identified, and then the phenotype for the resulting mutants ascertained
PolyPhen (7)
PolyPhen: prediction of functional effect of human nsSNPs
Input: Protein sequence and AAS, dbSNP id, or protein id
Output: Score ranges from 0 to a positive number, where 0 is neutral, and a high positive number is damaging
Algorithm: Uses sequence conservation, structure to model position of amino acid substitution, and SWISS-Prot annotation
PolyPhen: Structural consequences of the respective non-synonymous mutations in proteins
Map known disease mutations onto known three-dimensional structures of proteins
Compare results with a control set of substitutions observed between these proteins and their closely related homologs from other species that are unlikely to cause severe effects on the phenotype
Map a large number of non-synonymous SNPs onto protein structures
thought to be neutral
or to be the cause of only minor phenotypic effects
Goal: to obtain a lower limit estimate for the quantity of non-synonymous SNPs that might have phenotypic effects
How many SNPs are associated with multifactorial human disorders?
Properties of disease-causing mutations (7)
Disease-causing mutations are much more likely to occur at sites with low solvent accessibility
disease-causing mutations often affect intrinsic structural features of proteins
~70% of the disease-causing mutations are located in sites likely to be structurally and functionally important, namely sites with
<5% solvent accessibility
beta-strands
active sites
sites involved in disulphide bonds
evolutionarily conservative sites
Impact of amino acid variants (7)
Folding
Interaction sites
Solubility
Stability
Amino acid variant approach (7)
Check if substitution is
in an annotated active or binding site
affects interaction with ligands present in the crystallographic structure
leads to hydrophobicity or electrostatic charge change in a buried site
destroys a disulphide bond
affects the protein’s solubility
inserts proline in an alpha-helix
is incompatible with the profile of amino acid substitutions abserved at this site in the set of homologous proteins
(Empirical rules, work well!)
VarSite (7)
Disease variants and protein structure database
Types of variation effects and predictors (7)
Genetic tolerance predictors
amino acid substitutions
Synonymous variations
Insertions and / or deletions
Noncoding variations
Specific (tolerance) predictors for
Genes
Proteins
Families
Functional complexes
Diseases
Mechanism / effect predictors
DNA
Transcription factor binding sites
RNA
Splicing
miRNA targets
Protein
Aggregation
Localization
Post translation modification
Electrostatics
Tolerance predictors (7)
PolyPhen, PolyPhen-2, SIFT, …
Protein stability predictors (7)
Rosetta, FoldX, …
Splicing predictors (7)
Human Splicing Finder, Splice Site Finder, …
Cancer variation predictors (7)
CHASM, FATHMM, PolyPhen-2, SIFT, …
fRNA (8)
functional RNA
essentially synonymous with non-coding RNA
96% of total RNA
undergroups: rRNA, tRNA, snRNA, snoRNA, miRNA, siRNA
miRNA (8)
MicroRNA
putative translational regulatory gene family
ncRNA (8)
Non-coding RNA
all RNAs other than mRNA
rRNA (8)
ribosomal RNA
siRNA
small interfering RNA
active molecules in RNA interference
snRNA (8)
small nuclear RNA
includes spliceosomal RNAs
snmRNA (8)
small non-mRNA
essentially synonymous with small ncRNAs
snoRNA (8)
small nucleolar RNA
most known snoRNAs are involved in rRNA modification
stRNA (8)
small temporal RNA
for example, lin-4 and let-7 in C. elegans
tRNA (8)
transfer RNA
RNA genomics: topics (8)
Structure comparison & classification
Structure-based alignments
RNA features, predicting secondary structure
RNA databases
RNA motif searching
Specialized RNA gene finding
Generalized RNA gene finders
What are tRNAs? (8)
the codons in an mRNA molecule do not directly recognize the amino acids they specify
the translation of mRNA into protein depends on adaptor molecules that can recognize and bind both to the codon and, at another site on their surface, to the amino acid
these adaptors consist of a set of small RNA molecules known as transfer RNAs (tRNAs), each about 80 nucleotides in length
Adapter function of tRNA (8)
Two regions of unpaired nucleotides situated at either end of the L-shaped molecule are cruicial to the function of tRNA in protein synthesis
Anticodon: a set of three consecutive nucleotides that pairs with the complementary codon in an mRNA molecule
Short single-stranded region at the 3’ end of the molecule; this is the site where the amino acid that matches the codon is attached to the tRNA
How many tRNAs? (8)
the genetic code is redundant
several different codons can specify a single amino acid
two possible implications:
more than one tRNA for many of the amino acids
some tRNA molecules can base-pair with more than one codon
Both are true!
some amino acids have more than one tRNA
some tRNAs are constructed so that they require accurate base-pairing only at the first two positions of the codon and can tolerate a musmatch (or wobble) at the third position
In bacteria, wobble base-pairings make it possible to fit the 20 amino acids to their 61 codons with as few as 31 kinds of tRNA molecules
the exact number of different kinds of tRNAs differs from one species to the next
RNA Basics (8)
RNA bases: A, C, G, U
Canonical Base Pairs:
A-U (2 hydrogen bonds)
G-C (3 hydrogen bonds)
G-U (wobble pairing)
Bases can only pair with one other base
Steric freedom (wobble) (8)
XYU and XYC always encode the same amino acid
XYA and XYG usually do
Francis Crick surmised from these data that the steric criteria might be less stringent for pairing of the third base than for the other two
Ionisine (8)
Ionisine in the wobble position can become paired with A, U, C
Selenocysteine (8)
Bacteria, archaea, and eucaryotes have available to them a 21st amino acid that can be incorporated directly into a growing polypeptide chain
is essential for the efficient function of a variety of enzymes, contains a selenium atom in place of the sulfur atom of cysteine
is produced from a serine attached to a special tRNA molecule that base-pairs with the UGA codon, a codon normally used to signal a translation stop
requires a selenocysteine-specific translation factor
Features of RNA secondary and tertiary structure (8)
RNA secondary structure: intermediate step in the formation of a 3D structure (like proteins)
RNA secondary structure is primarily the double-stranded regions of the molecule formed by folding the single-stranded molecule back to itself to form loops
A run of bases downstream in the RNA sequence must be complementary to another upstream run
Watson-Crick base pairing between the complementary nucleotides G/C and A/U
Non-Watson-Crick G/U wobble base pairs
Complementary sequence in RNA molecules maintain RNA secondary structure
RNA Structure Representations (8)
Types of single- and double-stranded regions in RNA secondary structures (8)
the double-stranded regions will most likely form where a series of bases in the sequence can pair with a complementary set elsewhere in the sequence
more base pairing = increased energetic stability
Single-stranded regions destabilize neighboring double-stranded regions
Display of base pairs in an RNA secondary structure by a circle plot (8)
Linear RNA strand folded back on itself to create secondary structure
Corcularized representation uses this requirement
Arcs represent base pairing
Example of complex interactions between RNA secondary structure elements (8)
Problems in RNA sequence analysis (8)
Primary sequence based techniques that generally work quite well for protein sequence analysis are not well suited for studying RNA
Most functional RNAs appear to be selected more for maintenance of a particular base-paired structure than conservation of primary sequence
RNA secondary structure induces strong pairwise correlation in RNA sequence, usually manifested as Watson-Crick complementarity
RNA sequence analysis therefore must work with this pattern of correlation in addition to primary sequence conservation
More flexible methods needed:
capture both primary and secondary structure consensus information
flexibly scoring insertions, deletions, and mismatches
Applications:
database searching for new RNAs
multiple RNA sequence alignment
Sequence and base-pairing patterns can be used to predict RNA structure - 2 approaches (8)
Ab initio prediction from sequence
Two possible approaches:
Energy minimization methods: choose complementary sequence sets that provide the most energetically stable molecules
advantage: accomodation of experimental data and alignment data
disadvantages: no tertiary interactions predicted, very computationally expensive
take into account patterns of base-pairing that are conserved during evolution
patterns of covariation on RNA molecules are a manifestation of secondary structure
the computational challenge is to discover these covariable positions against the background of other sequence changes
advantages: simple, both secondary and tertiary interactions predicted
limitation: need a sufficient number of sequences that can be aligned
Self-complementaey regions in RNA sequences predict secondary structure (8)
for single-stranded RNA repeats represent regions that can potentially self-hybridize to form double strands
dot matrix analysis:
first axis: direct strand 5’ -> 3’
second axis: complementary strand 5’ -> 3’
find identities
sequence alignment as a method to determine structure (8)
Bases pair in order to form backbones and determine the secondary structure
aligning bases based on their ability to pair with each other gives an algorithmic approach to determining the optimal structure
Base pair maximization (8)
Scoring system
+1 per base pair
0 anything else
Calculate the best structure for a continuous subsequence from i to j in a complete sequence of length N
Key idea: the optimal score S(i,j) can be defined recursively in terms of optimal scores of smaller subsequences
Recursive definition of the best score for a subsequence i,j: 4 possibilities (8)
i,j are a base pair, added on to a structure for i+1…j-1
Score = +1
i is unpaired, added on to a structrue for i+1…j
Score = 0
j is unpaired, added on to a structrue for i…j-1
i,j are paired, but not to each other: the structure for i…j adds together substructures for two sub-sequences, i…k and k+1…j (bifurcation)
Optimal Score S(i,k) is independent of anything going on in k+1,…j, and vice versa
S(i,k) + S(k+1,j) is the score of the optimal structure on i,j conditional on i,j being paired, but not to each other
Optimal score is the maximum of all 4 possibilities
(To run this recursion efficiently, we just need to make sure that whenever we try to compute an S(i,j), we already have calculated the scores for smaller sub-sequences —> Dynamic programming!)
Dynamic programming algorithm (8)
Tabulate the scores S(i,j) in a triangular matrix
Initialize on the diagonal: subsequences of length 0 or 1 have no base pairs, so S(i,i) = S(i,i-1) = 0
By convention, the i, i-1 cells represent zero length sequences; the recursion must never access an empty matrix
Base pair maximization - Drawbacks (8)
Base pair maximization will not necessarily lead to the nost stable structure
may create structure with many interior loops or hairpins which are energetically unfavorable
Comparable to aligning sequences with scattered matches - not biologically reasonable
Trouble with Pseudoknots (8)
Pseudoknots cause a breakdown in the Dynamic Programming Algorithm
In order to form a pseudoknot, checks must be made to ensure base is not already paired - this breaks down the recurrence relations
Simplifying assumptions for secondary structure prediction (8)
the most likely structure - energetically most stable
energy associated with any position in the structure is only influenced by local sequence and structure
Energy associated with a particular base pair in a double-stranded region is influenced only by the previous base pair
Energy minimization (8)
Thermodynamic Stability
Estimated using experimental techniques
Theory: Most stable is the most likely
No Pseudoknots due to algorithm limitations
uses dynamic programming alignment technique
Attempts to maximize the score taking into account thermodynamics
MFOLD and ViennaRNA
Minimum free energy method (8)
Every base is compared for complementarity to every other base
the energy of each predicted structure is estimated by the nearest-neighbor rule:
sum the negative base-stacking energies for each pair of bases in the predicted double-stranded regions
add positive energies of destabilizing (unpaired) regions
The complementary regions are evaluated by a dynamic programming algorithm to predict the most energetically stable molecule
Finding the most energetically favourable structure (8)
Obtaining dynamic programming matrix values
consider the minimum energy values obtained by all previous complementary base pairs
decrease by stacking energy
increase by the destabilizing energy associated with noncomplementary bases
Repeat for the entire matrix
The increase depends on the type and length of loop that is introduced by the noncomplementary base pair
internal loop
bulge loop
hairpin loop
MFOLD (8)
derives energy dot plot
the method can be instructed to find structures within a certain percentage of the minimum free energy
Covariance model (CM) (8)
Describes both the secondary structure and the primary sequence consensus of an RNA
Can be applied to several RNA analysis problems
consensus secondary structure prediction
multiple sequence alignment
database similarity searching
Covariance models are constructed automatically
from existing RNA sequence alignments
even from initially unaligned example sequences
Iterative training procedure
Optimal algorithm for RNA secondary structure prediction based on pairwise covariations in multiple alignments
-> Covariation ensures ability to base pair is maintained and RNA structure is conserved
Need to allow for insertions, deletions, and mismatches to describe a family of related RNAs
Each node describes columns in a multiple alignment instead of bases in an individual sequence
Specific base assingments are replaced with symbol emission probabilities assigned to the 16 possible pairwise nucleotide combinations or 4 singlet nucleotides
Inferring Structure by comparative sequence analysis (8)
first step is to calculate a multiple sequence alignment
Requires sequences be similar enough so that they can be initially aligned
Sequences should be dissimilar enough for covarying substitutions to be detected
Mutual Information (8)
Rfam (8)
annotating non-coding RNAs in complete genomes
The combined secondary structure and primary sequence profile of a multiple sequence alignment of ncRNAs can be captured by statistical models, called profile stochastic context-free grammars (SCFGs)
Rfam is a database of nsRNA families represented by multiple sequence alignments and profile SCFGs
Covariance model drawbacks (8)
Needs to be well trained
Not suitable for searches of large RNA and for database searches
Structural complexity of large RNA cannot be modeled
Runtime
Memory requirements
Can be used for scanning candidate RNAs identified by other methods
microRNA (9)
A family of 21 - 25-nucleotide small RNAs
Function: altering the expression levels of a diverse repertoire of genes in a sequence-dependent manner
at the transcriptional or post-transcriptional level
regulate many aspects of development and physiology
Large family: hundreds of members in worms, flies, plants and mammals…
microRNA biogenesis and microRNA mediated gene expression regulation (8)
Primary miRNAs (pri-miRNAs) are initially processed by the Drosha/Pasha complex (‘microprocessor’) into ~60-70 nucleotide precursor miRNAs (pre-miRNAs) in the nucleus
These pre-miRNAs are transported into the cytoplasm by Exportin 5
in a next step, they are cleaved by Dicer into an imperfect double-stranded duplex
one strand of this duplex is incorporated into the RISC (RNA-induced silencing complex)
This complex binds to the target gene and will lead to tranlational repression
However, in some cases, miRNAs regulate gene expression by mRNA cleavage rather than by translational repression
RISC (9)
RNA-induced silencing complex
include
Dicer
Argonaute proteins
TRBP (HIV-1 transactivation responsive element (TAR) RNA-binding protein)
double-stranded (ds) RNA-binding protein (PACT)
microRNA biogenesis (9)
miRNAs are encoded in genomes either as independent transcriptional units with their own promoters (solo miRNAs) or as clusters of several miRNA genes transcribed as a single pri-miRNA
A substantial fraction of animal miRNA genes are located in introns of protein-coding genes
Whereas some intronic miRNAs have antisense orientation relative to their host genes and thus do not directly depend on host gene transcription, sense-oriented intronic miRNAs are thought to be processed as part of the host-gene mRNA and their expression correlates with that of their hosts
Mirtrons, which are encoded in introns, do not rely on Drosha processing and instead use the splicing machinery to generate pre-miRNAs
Splicing can result in tailed mirtrons, which require additional trimming by the exosome to produce a functional pre-miRNA
MirScan (9)
Slide a 110-nt window along both strands of the C. elegans genome, discarding
repetitive elements
segments with skewed base compositions not observed in known miRNA stem
loops
segments overlapping with annotated coding regions
Fold the window with the secondary structure-prediction program RNAfold
Identify predicted stem-loop structures with a minimum of 25 bp and a
folding free energy of at least 25 kcal/mole
Identify similar C. briggsae sequences by BLAST and fold them by RNAfold
Criteria used by MiRscan to identify miRNA genes among aligned segments of two genomes (9)
Seven features derived from the consensus hairpin structure (example for mir232):
base pairing of the miRNA portion of the fold-back
base pairing of the rest of the fold-back
stringent sequence conservation in the 5‘ half of the miRNA
slightly less stringent sequence conservation in the 3‘ half of the miRNA
sequence biases in the first five bases of the miRNA (especially a U at the first position)
a tendency toward having symmetric rather than asymmetric internal loops and bulges in the miRNA region
the presence of two to nine consensus base pairs between the miRNA and the terminal loop region, with a preference for 4–6 bp
Argonaute (9)
family of proteins that contain two conserved domains termed PAZ and PIWI
Multiple paralogs are typically present in each organism
Argonaute proteins are crucial for the maturation and function of microRNAs (miRNAs) and are essential for RNA interference
elF2C2 is a human Argonaute protein
Dicer (9)
RNAse III-type nuclease that also contains RNA helicase, PAZ and double-stranded RNA (dsRNA)-binding domains
Dicer processes linear, dsRNA into small interfering RNA (siRNA) duplexes and also excises mature miRNAs from pre-miRNAs
Micro-ribonucleoproteins (miRNPs) (9)
ribonucleoprotein complex containing mirRNAs, and Argonaute protein (elF2C2) and the proteins Gemin3 (and RNA helicase) and Gemin4
MicroRNAs (9)
~22nt noncoding RNAs derived from endogenous genes. Processed from one of the strands of longer (~75nt) hairpin-like precursors termed pre-miRNAs
miRNAs assemble in complexes termed miRNPs and recognize their mRNA targets by antisense complementarity. If the complementarity is extensive, the target mRNA is cleaved and the miRNA acts as an siRNA; if the complementarity is partial, the translation of the targte mRNA is repressed
miRgonaute/siRgonaute (9)
argonaute protein bound to miRNAs or siRNAs
RNAi-induced silencing complexes (RISC) (9)
Multisubunit nuclease that directs target RNA destruction in RNA interference (RNAi)
the core components of RISCs are siRNAs and Argonaute proteins
RNA interference (9)
Initially defined as a technique in which experimental introduction in C. elegans of dsRNA homologous to a target mRNA led to degradation of the targeted mRNA
More broadly defined as degradation of target mRNAs by homologous siRNAs
small interfering RNAs (9)
~22 to 25nt RNAs derived from processing of linear dsRNA
siRNAs assemble in complexes termed RISCs and target homologous RNA sequences for endonucleolytic cleavage
Synthetic siRNAs also incorporate RISCs and cleave homologous RNA sequences
miRNA gene prediction algorithms (9)
miRscan
miRseeker
srnaloop
MirScan: Computational identification of stem loops in C. elegans (9)
Slide a 100-nt window along both strands of the C. elegans genome, discarding
segments with skewed base compositions not observed in known miRNA stem loops
Identify predicted stem-loop structures with a minimum of 25bp and a folding free energy of at least 25 kcal/mole
This procedure yielded ~40.000 pairs of potential miRNA hairpins
For each pair of potential miRNA hairpins, generate a consensus C. elegans / C. briggsae structure
stringent sequence conservation in the 3’ half of the miRNA
the presence of two to nine consensus base pairs between the miRNA and the terminal loop region, with a preference for 4-6 bp
-> miRNA base pairing; extension of base pairing; 5’ conservation; 3’ conservation; bulge symmetry; distance from loop; initial parameter
-> for a given feature i with a value x_i, MiRsan assigns a log-odds score -> the overall score assigned to a candidate miRNA is simply the sum of the log-odds scores for the seven features
How many miRNAs are in the human genome? (9)
2300 true human mature miRNAs
Prediction of Mammalian MicroRNA targets (9)
TargetScan
thermodynamics-based modeling of RNA:RNA duplex interactions
comparative sequence analysis
Input:
miRNA that is conserved in multiple organisms
a set of orthologous 3’ UTR sequences from these organisms
Structures, energies, and scoring for predicted RNA duplexes (9)
search the UTRs in the first organism for segments of prefect Watson-Crick complementarity to bases 2-8 of the miRNA: “miRNA seed” and “seed matches”
extend each seed match with additional base pairs to the miRNA as far as possible in each direction, allowing G:U pairs, but stopping at mismatches
optimize basepairing of the remaining 3’ portion if the miRNA to the 35 bases of the UTR immediately 5’ of each seed match using the RNAfold program
assign a folding free energy G to each such miRNA:target site interaction
assign a Z score to each UTR
sort the UTRs in this organism by Z score and assign a rank R_i to each predict as targets those genes for which both Z_i >= Z_c and R_i <= R_c for an orhtologous UTR sequence in each organism, where Z_c and R_c are pre-chosen Z score and rank
Factors limiting the sensitivity of predicting miRNA targets conserved in multiple genomes (9)
Incompleteness of orthologous gene annotations
Some targets may not meet the stringent seed matching, Z score or rank criteria
Some target sites may lie outside the 3’ UTR (plants)
Some targets may not be conserved in the complete set of organisms
Method does not model the simultaneous interaction of multiple miRNA species with the same UTR
The actual number of target genes regulated by each miRNA is likely to be substantially higher
Comparison of the mechanisms of miRNA biogenesis and action in plants and animals (9)
Predictions have yielded larger and more variable precursor miRNA molecules for plants than for animals
The Drosha gene that processes the primiRNA to the pre-miRNA in animals is absent from plant genomes
In plants, the Dicer-like 1 (DCL1, a RNase-III-like protein) appears to catalyze the processing of the primary miRNA transcript to form the miRNA:miRNA* complex
Summary of difference between plant and animal miRNA systems (9)
MicroRNA targtes in Drosophila: Algorithm and analysis pipeline (9)
Source data consisting of miRNAs and 3’ UTRs are processed initially by the miRanda algorithm, which searches for complementarity matches between miRNAs and 3’ UTRs using dynamic programming alignment and thermodynamic calculation.
all results are then post-processed by first filtering out results not consistently conserved accoridng to target sequence similarity with D. pseudoobscura and A. gambiae, then by sorting and ranking all remaining results.
Finally, all miRNA targte gene predictions are annotated using data from FlyBase and stored for further analysis
A more recent analysis of the human genome by miRanda (9)
Multiplicity and cooperativity in miRNA-target interactions are key features of the control of translation by miRNAs (9)
One miRNA can target more than one gene (multiplicity). Some miRNAs appear to be very promiscuous, with hundreds of predicted targets, but most miRNAs control only a few genes
One gene can be controlled by more than one miRNA (cooperativity). Some target genes appear to be subject to highly cooperative control, but most genes do not have more than four targets sites. Although specific values are likely to change with refinement of target prediction rules, the overall character of the distribution may well be a biologically relevant feature reflecting system properties of regulation by miRNAs.
TarBase (9)
the benchmark set
a database for experimentally supported miRNA-target gene interactions, reports aroung 130 mammalian entries
TarBase also reports the experiments that were performed to provide support for each miRNA-target gene interaction, which range from in vitro reporter silencing assays to in vivo miRNA overexpression studies
Biased set!
miRBase (9)
the microRNA database
microRNA sequences, targets and gene nomenclature
nonrepetitive DNA (10)
sequences that are unique: there is only one copy in a haploid genome
repetitive DNA (10)
sequences that are present in more than one copy in each genome
moderately repetitive DNA, complex repeats: short sequences (10-1000 copies, typically dispersed throughout the genome)
highly repetitive DNA, simple repeats: very short sequences (<100 bp), many thousand of copies in the genome, often organized as long tandem repeats
2 types of repetitive DNA (10)
Basic prototypes of human repetitive DNA (10)
tandemly repeated DNA (minisatellites, microsatellites, centromeric and telomeric repeats)
long interspersed nuclear element (LINE) retro(trans)posons
short interspersed nuclear element (SINE) retro(trans)posons
autonomous and nonautonomous endogenous retroviral elements
autonomous and nonautonomous DNA transposons
-> jede 2. Base steckt in so einem repeat: 50% des Genoms (in Pflanzen 80% sogar)
“Simple sequence repeats” (tandem repeats) (10)
Microsatellites
Minisatellites
‘Cryptically simple repeats’: result from reshuffling of a limited number of DNA sequence motifs in various orientations
‘Low-complexity repeats’: usually derived from other simple repeats, although their periodic character may be obscured by mutations
Satellite and telomeric repeats
Micro- and mini-satellites (10)
from one to a dozen base pairs
examples: (A)n, (CA)n, (CGG)n
these may be formed by replication slippage
Minisatellites: a dozen to 500 base pairs
Simple sequence repeats of a particular length and composition occur preferentially in different species
In humans, an expansion of triplet repeats such as CAG is associated with at least 14 disorders (including Huntington’s disease)
Micro- and minisatellites are often polymorphic and the biological mechanisms behind this phenomenon are thought to be different for the two groups. Owing to the polymorphism and relatively uniform distribution in chromosomal DNA, micro- and minisatellites are ampng the most informative genetic markers
Satellite and telomeric repeats (10)
a separate category of tandem repeats is represented by satellite and telomeric repeats
unlike micro- and minisatellites, these repeats are confined to well-defined chromosomal regions
satellites are primarily found in the centromeric regions of chromosomes
telomeric repeats occupy chromosomal ends, or telomeres
Example of a telomeric repeat: TTAGGG (in humans)
Centromeric repeats (e.g. a 171 base pair repeat of a satellite DNA in humans)
Such repetitive DNA can span millions of base pairs, and it is often species-specific
Human genome: simple sequence repeats (10)
Simple sequence repeats (SSR) are perfect (or slightly imperfect) tandem repeats of k-mers. Microsatellites have k=1 to 12, while minisatellites have k from about a dozen to 500 base pairs
Micro- and minisatellites comprise 3% of the genome
AC, AT, and AG are the most common dinucleotide repeats
“Complex repeats” (interspersed repeats) (10)
Constitute ~45% of the human genome
Derived from biologically active ‚transposable elements‘ (TEs)
Involve RNA intermediates (retroelements) or DNA intermediates (DNA transposons)
Retroelements: reproduce via reverse transcription followed by intergration into the host DNA
long-terminal repeat transposons (LTR)
long interspersed elements (LINEs): these encode a reverse transcriptase
short interspersed elements (SINEs): these include Alu repeats
DNA transposons: capable of integrating themselves to, and excising themselves from, the host genome, thus taking advantage of the host replication through this ‚cut-and-paste‘ mechanism
DNA transposons constitute 3% of the human genome
Mobile genetic elements (transposons, transposable elements (TE)) - three different mechanisms for transposition (10)
(left to right)
Conservative transposition: the element itself moves from the donor site into the target site
Replicative transposition: the element moves a copy of itself to a new site via a DNA intermediate
Retrotransposition: the element makes an RNA copy of itself which is reversed-transcribed into a DNA copy
Class I transposable elements (10)
Transpose through an RNA intermediary which is
transcribed from genomic DNA
reverse-transcribed into DNA by a TE-encoded reverse transcriptase (RT)
reintegrates into a genome
Each replication cycle produces one new copy
class I TEs are the major contributors to the repetitive fraction in large genomes
Retrotransposons are divided into five orders based on mechanistic features, organization and reverse transcriptase phylogeny
DIRS-like elements
Penelope-like elements (PLE)
LINEs (Long Interspersed Elements)
SINEs (Short Interspersed Elements)
Group-specific antigen (gag) (10)
codes for core and structural proteins of the virus
Polymerase (pol) (10)
codes for reverse transcriptase, protease and integrase
envelope (env) (10)
codes for the retroviral coat proteins
LTR retrotransposons (10)
Certain Long Terminal Repeats (LTRs) several hundred to several thousand base pairs in length
Both exogeneous retroviruses and LTR retrotransposons contain:
gag gene, that encodes a viral particle coat
pol gene that encodes a reverse transcripase, ribunuclease H, and integrase, which provide the enzymatic machinery for reverse transcription and integration into the host genome
Unlike LTR retrotransposons, exogenous retroviruses contain an env gene, which encodes an envelope
Some LTR retrotransposons may contain remnants of an env gene but their insertion capabilities are limited to the originating genome
most of the LTR sequences (85%) are found only as isolated LTRs, with the internal sequence being lost
LINEs (10)
do not have the long terminal repeats
have a poly-A tail at the 30 end
Comprise about 21% of the human genome
contain Pol-II promoter and two ORFs
ORF 1: encodes a non-sequence-specific RNA binding portien, functions as chaperone for mRNA
ORF 2: encodes an endonuclease, which makes a single stranded nick in the genomic DNA, and a reverse transcriptase, which uses the nicked DNA to prime reverse transcription of LINE RNA from 3’ end
Because they encode their own retrotransposition machinery, LINE elements are regarded as autonomous retrotransposons
SINEs (10)
evolved from RNA genes, such as tRNA genes
Short, up to 1000 bp long
Do not encode their own retrotranscription machinery (nonautonomous elements)
The most abundant SINEs: Alu repeats (10)
Full-lenght Alu elements are ~300 bp long
Commonly found in
introns
3’ untranslated regions of genes
intergenic genomic regions
Most abundant SINEs
the Alu gene family comprises more than 10% of the mass of the human genome
Alu sequences accumulate preferentially in gene-rich regions
not uniformly distributed in the human genome
The origin of Alu elements (10)
the origin and amplification of Alu elements are evolutionarily recent events
coincided with the radiation of primates in the past 65 million years
Ancestrally derived from the 7SL RNA gene, which forms part of the ribosome complex
Origins of all Alu elements can be traced to an initial gene duplication early in primate evolution
subsequent and continuing amplification of these elements
The origins of a variety of SINEs can be traced to the genes of various small, highly structured RNAs, such as transfer RNA genes, the transcription of which depends on RNA polymerase III
The expansion of SINEs of different origins has occurred simultaneously in several diverse genomes
reasons for this simultaneous expansion unknown
A typical human Alu element (10)
Alu structure is bi-partite
3’ half contains an additional 31-bp insertion relative to the 5’ half
Alu monomers also exist in the human genome, as do various truncated copies of both monomers and dimers
Total length ~300bp, depending on the length of the 3’ oligo(dA)-rich tail
Central A-rich region
Flanked by short intact direct repeats derived from the site of insertion
The 5’ half of each sequence contains an RNA-polymerase-III promoter
3’ terminus almost always consists of a run of As (occasionally interspersed with other bases)
Alu subfamilies (10)
Mutations that accumulate in the source genes ar subsequently inherited by thier copies
the human Alu family is composed of several distinct subfamilies of different genetic ages that are characterized by a hierarchical series of mutations
A number of human Alu elements share common diagnostic sequence features and comprise subfamilies that have expanded in different evolutionary time frames
Older Alu subfamilies:
smallest number of diagnostic subfamily-specific mutations
largest number of random mutations (up to 20% pairwise divergence), which confirms their ancient origin
Younger Alu subfamilies:
increasing number of subfamily-specific mutations
smaller number of random mutations (as little as 0.1% pairwise divergence) that accumulate after the individual Alu elements integrate into the genome
Class II transposable elements (10)
Move by a conservative cut-and-paste mechanism: excision of the donor elemtn, reinsertion elsewhere in the genome
Subclass I: „cut-and-paste“ transposons
Contain terminal inverted repeats and encode a transposase that binds near the inverted repeats and mediates mobility
Subclass II
Replicate without double-strand cleavage
Pseudogenes (10)
these genes have a stop codon or frameshift mutation and do not encode a functional protein
they commonly arise from retrotransposition, or following gene duplication and subsequent gene loss
Segmental duplications (10)
Large, nearly identical copies of genomic DNA, which range in size from 1 to >200 kb and are present in at least two locations in the human genome
intrachromosomal
interchromosomal
Originate from the duplicative transpositions of small portions of chromosomal material
Contain both high-copy number repeats and gene sequences with intron-exon structures and, unlike other repeat classes, share no defining characteristics
Distribution among human chromosomes non-uniform
About 5% of the human genome consists of segmental duplications
Duplicated regions often share very high (99%) sequence identity
Classification of repeated sequences (10)
Repetitive Sequences
Dispersed Repeats
Class I Retrotransposons
Class II DNA Transposons
Tandem Repeats
Microsatellutes
Gene Families
Pseudogenes
Segmental Duplications
Why study repeats (10)
Repetitive DNA is ubiquitous in eukaryotic genomes
Repeats are believed to play significant roles in genome evolution and disease
Mobile elements (transposons and retrotransposons) may contain coding regions that are hard to distinguish from other types of genes
Repeats often induce many local alignments, complicating sequence assembly, comparisons between genomes and analysis of large-scale duplications and rearrangements
De novo approaches (for what? for finding repetitive elements?) (10)
Generally start with a self-comparison with a sequence similarity detection method to identify repeated sequence
Use a clustering method to group related sequences into families
Detecting repetition by sequence alignment methods is relatively easy
Automatically defining biologically reasonable families is more difficult
Local sequence alignments do not usually correspond to the biological boundaries of the repeats
degraded or partially deleted copies
related but distinct repeats
segmental duplications covering more than one repeat
Difficulty in defining element boundaries then causes a variety of subsequent problems in clustering related elements into families
Approaches to finding repetitive elements (10)
Find all the repeats in a genome
k-mer approach
sequence self-comparison
periodicity approach
Build a consensus of each family of related sequences
Classify detected sequences
K-mer approach (to finding repetitive elements) (10)
Sequences are scanned for overrepresented string of certain length
Repeats that belong to the same family are compositionally similar and share some oligomers
If the repeats occur many times in a genome, then those oligomers should be overrepresented
Since repeats and transposons in particular are not exactly the same, some mismatches must be allowed when oligo frequencies are calculated
Challenge: to determine optimal size of an oligo (k-mer) and the number of mismatches allowed
these parameters should be different for different types of transposons, i.e., low versus high copy number, old versus young transposons, and transposon class
Some programs have been use a suffix tree data structure including
Another approach is to use fixed length k-mers as seeds and extend those seeds to define repeat’s family
REPUTER (10)
Determines all exact repetitive substrings in complete genomes
Exact repeats are only a small fraction of all repeats of biological interest
However, they often form core blocks of approximate repeats
Running time: linear in the length of the genome
Reduced time and space complexity
RepeatFinder (10)
a clustering method for repeat analysis in DNA sequences
Idea:
First identify all exact repeats in the input sequence
then define repeat classes by merging and extendind these short exact matches
Step 1: preprocessing (apply Reputer)
Step 2: merging and repeat generation
Identifying repeat families: manual approaches (10)
For widely studied genomes such as human and mouse, libraries of repeat families have been manually curated:
Repbase Update library (a database and an electronic journal of repetitive elements)
RepeatMasker library
PFAM (10)
Sammlung von Markov-Ketten / Modellen (sequence to model alignment)
TODO
General scheme for computer-assisted idnentification of repetitive DNA (10)
RepeatMasker (10)
Best known program
Uses precompiled representative sequence libraries to find homologous copies of known repeat families
Indispensable in genomes in which repeat families have already been analyzed
For new genomes, new repeat libraries first need to be manually compiled
de novo method desired that automates the process of compiling RepeatMasker libraries
program that screens DNA sequences for interspersed repeats and low complexity DNA sequences
identifies simple sequence repeats & Alu repeats and masks repetitive DNA (FASTA format)
Hi-C (11)
Three-dimensional genome structure
Ribo-Seq (11)
Ribosome-protected mRNA fragments (that is, active in translation)
Sanger sequencing (first generation) (11)
The insertion of a terminator base into the growing strand halts the copying process. This is a random event that results in a series of fragments of different lengths, depending on the base at which the copying stopped. The fragments are separated by size by running them through a gel matrix, with the shortest fragments at the bottom and largest at the top
The terminators are labelled with different fluorescent dyes, so each fragment will fluoresce a particular color depending on whether it ends with an A, C, G or T base
The sequence is ‘read’ by a computer. It generates a ‘sequence trace’, as shown here, with the colored peaks corresponding to fluorescent bands read from the bottom to the top of one lane of the gel
Today:
3 decades of gradual improvement
Read-lengths of up to ~1000bp
per-base “raw” accuracies as high as 99.999%
$0.50 per kilobase
Next-Generation Sequencing (NGS) Instruments (11)
Second generation technologies (ensemble of DNA molecules)
Roche/454
Illumina/Solexa
Life Technologies (SOLiD: Sequencing by Oligonucleotide Ligation and Detection)
Third generation technologies (single molecules)
Pacific Biosciences
Ion Torrent
Oxford Nanopore
Common features of next-generation DNA sequencing instruments (11)
Random fragmentation of DNA, ligation with custom linkers = “a library”
Library amplification on a solid surface (either bead or glass)
Direct step-by-step detection of each nucleotide base incorporated during the sequencing reaction
Hundreds of thousands to hundreds of millions of reactions imaged per instrument run = “massively parallel sequencing”
Shorter read lengths than capillary sequencers
A “digital” read type that enables direct quantitative comparisons
Shotgun sequencing with cyclic-array methods (11)
Common adaptors are ligated to fragmented genomic DNA
Array of millions of spatially immobilized PCR colonies or ‘polonies’
many copies of a single shotgun library fragment
All polonies are tethered to a planar array
a single microliter-scale reagent volume (e.g., for primer hybridization and then for enzymatic extension reactions) can be applied to manipulate all array features in parallel
imaging-based detection of fluorescent labels incorporated with each extension can be used to acquire sequencing data on all features in parallel
Successive iterations of enzymatic interrogation and imaging are used to build up a contiguous sequencing read for each array feature
DNA sequencing: conventional vs. second-generation (11)
Conventional: DNA fragmentation —> in vivo cloning and amplification —> Cycle sequencing —> Electrophoresis
Second-generation: DNA fragmentation —> In vitro adaptor ligation —> Generation of polony array —> Cyclic array sequencing
Features of NGS instruments (11)
Each platform: complex interplay of enzymology, chemistry, high-resolution optics, hardware, and software engineering
Highly streamlines sample preparation steps prior to DNA sequencing
Main idea: amplify single strands of a fragment library and perform sequencing reactions on the amplified strands
The fragment libraries are obtained by annealing platform-specific linkers to blunt-ended fragments generated directly from a genome or DNA source of interest
molecules then can be selectively amplified by PCR
no bacterial cloning step is required to amplify the genomic fragment in a bacterial intermediate
Helicos and Pacific Biosystems: “single molecule” sequencers, do not require any amplification of DNA fragments
All platforms offer paired end read capability, e.g. sequences can be derived from both ends of the library fragments
good for sequencing large and complex genomes because they can be more accurately placed (“mapped”) than can single ended short reads
Base extension (11)
a single-.stranded DNA fragment (template) is anchored to a surface with the starting point of a complementary strand, called the primer, attached to one of its ends
when fluorescently tagged nucleotides (dNTPs) and polymerase are exposed to the template, a base complementary to the template will be added to the primer strand
Remaining polymerase and dNTPs are washed away, then laser light excites the fluorescent tag, revealing the identity of the newly incorporated nucleotide. The tag is then stripped away, and the process starts anew
Ligation (11)
An “anchor primer” is attached to a single-stranded template to designate the beginning of an unknown sequence
Short, fluorescently labeled “query primers” are created with degenerate DNA, except for one nucleotide at the query position bearing one of the four base types
The enzyme ligase joins one of the query primers to the anchor primer, following base-pairing rules to match the base at the query position in the template strand
The anchor-query-primer complex is then stripped away and the process repeated for a different position in the template
Amplification (11)
Bause light signals are difficult to detect at the scale of a single DNA molecule, base-extension or ligation reactions are often performed on millions of copies of the same template strand simultaneously. Cell-free methods (a and b) for making these copies onvolve PCR on a miniaturized scale
a) Polonies - polymerase colonies - created directly on the surface of a slide or gel each contain a primer, which a template fragment can find and bind to. PCR within each polony produces a cluster containing millions of template copies
b) Droplets containing polymerase within an oil emulsion can serve as tiny PCR chambers to produce bead polonies. When a template fragment attached to a bead is added to each droplet, PCR produces 10 million copies of the template, all attached to the bead.
Next-generation sequencing (11)
Library preparation, DNA fragmentation and in vitro adaptor ligation
emulsion PCR
Pyrosequencing (454 sequencing)
Sequencing-by-ligation (SOLiD platform)
bridge PCR
Sequencing-by-synthesis (Solexa technology)
Clonal amplification of sequencing features: Roche 454 and SOLiD platforms (11)
Emulsion PCR
PCR amplification in the context of a water-in-oil emulsion
One of the PCR primers is tethered to the surface (5’-attached) of micron-scale beads that are also included in the reaction
Each clonally amplified bead bears on its surface PCR products corresponding to amplification of a single molecule from the template library
Roche/454 Library Construction and emPCR (11)
the library fragments are mixed with a population of agarose beads whose surfaces carry oligonucleotides complementary to the 454-specific adapter sequences on the fragment library, so each bead is associated with a single fragment
Each of these fragment:bead complexes is isolated into individual oil:water micelles that also contain PCR reactants
Thermal cycling (emulsion PCR) of the micelles produces approximately one million copies of each DNA fragment on the surface of each bead
These amplified single molecules are then sequenced en masse
Pyrosequencing (11)
Beads are arrayed into a picotiter plate that holds a single bead in each of several hundred thousand single wells
fixed location at which each seqeuncing reaction can be monitored
Enzyme-containing beads that catalyze the downstream pyrosequencing reaction steps are added
Each incorporation of a nucleotide by DNA polymerase results in the release of pyrophosphate, which initiates a series of downstream reactions that ultimately produce light by the firefly enzyme luciferase
The amount of light produced is proportional to the number of nucleotides incorporated (up to the point of detector saturation)
CCD camera that records the light emitted at each bead
Illumina: sequencing-by-synthesis (11)
Single molecule amplification
starts with an Illumina-specific adapter library
takes place on the oligo-derivatized surface of a flow cell
sequencing templates are immobilized on a proprietary flow cell surface
Flow cell: 8-channel sealed glass microfabricated device
allows bridge amplification of fragments on its surface
uses DNA polymerase to produce multiple DNA copies, or clusters, that each represent the single molecule that initiated the cluster amplification
A separate library can be added to each of the eight channels, or the same library can be used in all eight, or combinations thereof
Solid-phase amplification creates up to 1000 identical copies of each single template molecule in close proximity
sufficient for reporting incorporated bases at the required signal intensity for detection during sequencing
Cluster generation (11)
Prepare genomic DNA sample: Randomly fragment genomic DNA and ligate adapters to both ends of the fragments
Attach DNA to surface: Bind single-stranded fragments randomly to the surface of flow cell
Bridge amplification: Add unlabeled nucleotides and enzyme to initiate solid-phase bridge amplification
Fragments become double-stranded: The enzyme incorporates nucleotides to build double-stranded bridges
Denature the double-stranded molecules: Denaturation leaves single-stranded templates anchored to the substrate
Complete amplification: Several million dense clusters of double-stranded DNA are generated in each channel of the flow cell
Sequencing by synthesis (11)
Determine first base: The first sequencing cycle begins by adding four labeled reversible terminators, primers, and DNA polymerase
Image first base: After lase excitation the emitted fluorescence from each cluster is captured and the first base is identified
Determine second base: The next cycle repeats the incorporation of four labeled reversible terminators, primers, and DNA polymerase
Image second chemistry cycle: After lase excitation the image is captured as before, and the identity of the second base is recorded
Sequencing over multiple chemistry cycles: The sequencing cycles are repeated to determine the sequence of bases in a fragment, one base at a time
Align data: The data are aligned and compared to a reference, and sequencinf differences are identified
Ligase-mediated sequencing (11)
Primer is annealed to the shared adapter sequences on each amplified fragment
DNA ligase is provided along with specific fluorescent-labeled 8mers, whose 4th and 5th bases are encoded by the attached fluorescent group
Each ligation step is followed by fluorescence detection, after which a regeneration step removes bases from the ligated 8mer (including the fluorescent group) and concomitantly prepares the extended primer for another round of ligation
Principles of two-base encoding (11)
Pacific Biosciences (11)
Single Molecule Real Time (SMRT) technology
Nanophotonic structure, small volume of observation
Parallel, simultaneous detection of thousands of single-molecule sequencing reactions
Fluorescent nucleotides linked to the phosphate moiety, and not to the base
Continuous observation of DNA synthesis over thousands of bases without steric hindrance
Long reads, short run times, high quality
Oxford Nanopore (11)
Claimed advantages
high-throughput, ultralong reads at low cost
Single-molecule read-out
Millions of base pairs per hour
Minimal preparation (blood samples)
Simple sequence assembly
Full genome scans to identify
rare mutations contributing to mendelian diseases
Sequence variants that contribute to complex disorders
Systems
MinION: memory key-sized disposable unit, $1000, plugged into a laptop, gigabase of DNA
GridION: genome-scale sequencing
Problems
No data has been released so far
Current error rate 4% likely to be improved
Problem: base-calling informatics to interpret current changes
Previous attempts at single molecule sequencing unsuccessful: Pacific Biosciences, Helicos
Nanopore Sequencing (11)
Like electrophoresis, this technique draws DNA toward a positive charge. To get there, the molecule must cross a membrane by going through a pore whose narrowest diameter of 1.5 nanometers will allow only single-stranded DNA to pass
As the strand transits the pore, nucleotides block the opening momentarily, altering the membrane’s electrical conductance, measured in picoamperes (pA)
Physical differences between the four base types produce blockades of different degrees amd duration.
A close-up of a blockade event measurement shows a conductance change when a 150-nucleotide strand of a single base type passed thorugh the pore.
Refining this method to improve its resolution to single bases could produce a sequence readout such as the hypothetical example at bottom and yield a sequencing technique capable of reading a whole human genome in just 20 hours without expensive DNA copying steps and chemical reactions.
Base qualities (11)
Sequencing machines output the read sequence with a list of qualities, usually one per called base
Qualities are small numbers, typically between 0 and 40, which express the probability that the calling of the corresponding base went wrong
The PHRED software reads DNA sequencing trace files, calls bases and assigns a quality value to each base called
PHRED quality score of a base call, defined in terms of the estimated probability of error:
A PHRED score of 20 is considered an acceptable level of accuracy
FastQC (11)
a quality control for high-throughput sequence data
FASTQ format (11)
De facto common format for data exchange between tools
Simple extension to the FASTA format: the ability to store a numeric quality scores of nucleotides
Minimal representation of a sequencing read
nothing about the relative levels of the 4 nucleotides
no attempt to deal with flow or color space data
PHRED scores are stored as single characters
Sanger FASTQ files use ASCII 33-126 to encode PHRED qualities from 0 to 93
broad range of error probabilities, from 1.0 (wrong base) through to 10^-9.3 (an extremely accurate read)
Other variants of FASTQ: Solexa, Illumina, ABI SOLID
Example FASTQ file:
Quality trimming (11)
Sample barcoding and de-multiplexing (11)
Common sequence artefacts in NGS data (11)
Read errors
Base calling errors
Small insertions and deletions
Poor quality reads
Primer / adapter contamination
Removal of adapter sequences (11)
Necessary when the read length > molecule sequences e.g. small RNAs
Different scenarios requiring adapter removal:
Read runs into adapter: Trim the 3’ end
Full adapter in the beginning: Trim / discard the reads based on the residual minimum read length
Adapter within read: Trim the adapter region but retain reads only with a minimum read-length
Tools for adapter trimming: fastx_clipper (FastX-Toolkit), PRINSEQ
Assembling a genome (11)
Fragment DNA and sequence
Find overlaps between reads
Assemble overlaps into contigs
Assemble contigs into scaffolds
Why are genomes hard to assemble? (11)
Biology: High ploidy, heterozygosity, repeat content
Sequencing technology: Large genomes, sequencing errors
Computational: Big data, complex structure
Accuracy: Hard to assess correctness
Read coverage (11)
Redundant coverage:
high-quality assembly
increased accuracy (error rate: single read 1%, eightfold coverage: 10^-16)
necessary to sequence polymorphic alleles within diploid or polyploid genomes
Human genome project
Sanger sequencing, 30 million reads, up to 800 bases
~24 gB of data
~eightfold coverage
Second-generation sequencing
read length 35-400 bases
greater speed, lower cost
much greater coverage needed
Not all problems can be overcome with high coverage
main challenge: repeats
k-mer uniqueness ratio (11)
percentage of the genome that is covered by unique sequences of length k or longer
Simple “greedy” assembly algorithm (11)
Compare all pairs of reads with each other
Those that overlap most are merged first
Sequencing errors: allows for a small number of differences in the overlapping sequence (typically 1% - 10%)
Once all overlaps are computed, the reads with the longest overlap are concatenated to form a contig (contiguous sequence)
Repeats, each time merging the sequences with the longest overlap until all overlaps are used
This method fails for repetitive sequences longer than the read length. The greedy algorithm will assemble all copies of a repeat into a single instance, because all reads with the repetitive sequence overlap equally well
The greedy algorithm cannot tell how to connect the unique sequences on either end of a repeat, and it can easily assemble together distant portions of the genome into misassembled, “chimeric” contigs
Long reads: overlap graph (11)
based on the set of 10 8-bp reads, we can build an overlap graph in which each read is a node, and overlaps >5 bp are indicated by directed edges. Transitive overlaps, which are implied by other longer overlaps, are shown as dotted edges
Repeat sequences create a fork in the graph
Short reads: de Bruijn graph (11)
in a de Bruijn graph, a node is created for every k-mer in all the reads
here the k-mer size is 3
Edges are drawn between every pair of successive k-mers in a read, where the k-mers overlap by k-1 bases.
Note here we have only considered the forward orientation of each sequence to simplify the figure
RNAseq (11)
comprehensive study of the transcriptome
Identifies the full set of transcripts, including large and small RNAs, novel transcripts from unannotated genes, rare transcripts, splicing isoforms and gene-fusion transcripts
Reveals the complex landscape and dynamics of the transcriptome from yeast to human at an unprecedented level of sensitivity and accuracy
Base-pair-level resolution and a much higher dynamic range of expression levels
RNAseq workflow (11)
Quality control
Alignment of reads to reference genome
Transcriptome assembly
Differential expression
Overview of the experimental steps in an RNA sequencing (RNA-seq) protocol (11)
Experimental design: number of replicates, depth of sequencing
Parameters: alignment rate, desired power, significance level, log-fold change
Data generation (11)
Data analysis (11)
Mapping billions of short reads onto genomes (11)
RNA-Seq assays produce short reads sequenced from processed mRNAs
Directly aligning these reads to the genome will produce the alignments shown in black but will fail to align the blue reads
A spliced-read mapper will also report the (blue) alignments spanning intron boundaries
Spaced seed indexing (e.g. MAQ) (11)
Indexing:
cut each position in the reference into equal-sized pieces (‘seeds’)
seeds are paired and stored in a lookup table
Matching
each read is also cut up
pairs of seeds are used as keys to look up matching positions in the reference
If the entire read aligns perfectly to the reference genome, then all of the seeds will also align perfectly
If there is one mismatch (SNP) must fall within one of the four seeds
other three will still match perfectly
Two mismatches will fall in at most two seeds
the other two to match perfectly
Suffix tree for ATCATG (11)
The Burrows-Wheeler transform (11)
produces a permutation of a string
Reversible - the original string can be recovered
Applications: string compression, pattern matching
Form successive circular permutations of the string
Sort these lines into alphabetical order
Report the last column: the Burrows-Wheeler transform
The Burrows-Wheeler Transform brings repeats together, facilitating compression
the transformed string can be compressed by run-length encoding: transcribe each repeated character once, followed by the number of times it is repeated
Because the transform is reversible, compressing the transformed string is equivalent to compressing the original string
Use of the Burrows-Wheeler transform for searching for patterns in strings (11)
(nochmal anschauen als YT-Video z.B.)
Find occurrences of pattern P (aca) within a string S (acaaca$)
S’: Burrows-Wheeler Transform of S
Any pattern P that appears in S is the prefix of one of the suffixes
P is the prefix of suffixes 4 and 5
The Borrws-Wheeler Transform provides an index of the suffix array, that obviates the need to search for lines beginning with aca in the sorted suffix array of acaaca$
Assign to each alphabetic character in S a rank that specifies the number of times that character occurs previously in S
Attach the rank to each alphabetic character in the original string as a superscript
Compute the Burrows-Wheeler Matrix: the i-th occurrence of any character in the last column has the same rank as its i-th occurrence in the first column (the Burrows_Wheeler Transform of the original string)
Search for aca backwards:
starting with the final a: rows 2-5 begin with a
the character preceding the final a is c
the character in column 1 is preceded in the full string by the character in column 7 in the same row
for rows 2 and 3, the a in column 1 corresponds to the c in column 7, with ranks 0 and 1
we now know that the first and second occurrences of c are part of a ca substring
we also know that it is the first two occurrences of c (ranks 0 and 1) that vegin with ca
both have an a in column 7, completing the pattern aca
the corresponding ranks are 2 and 3, indicating that the two occurrences of aca begin with the first and third appearances of a^
Burrows-Wheeler Transform (e.g. Bowtie) (11)
Idea: suffix arrays from BWT are more efficient
Data compression: reorders the genome such that sequences that exist multiple times appear together in the data structure
Read is aligned one character at a time
Each successively aligned new character allows Bowtie to winnow the list of positions to which the read might map
If Bowtie cannot find a location where a read aligns perfectly, the algorithm backtracks to a previous character of the read, makes a substitution and resumes the search
first solve a simple subproblem - align one character
building on that solution solve a slightly harder problem - align two characters
and so on, until the entire read has been aligned
(30 times faster than MAQ)
Strategies for gapped alignments of RNA-seq reads to the genome: Exon-first approach (11)
Map full, unspliced reads (exonic reads)
Remaining reads are divided into smaller pieces and mapped to the genome
An extension process extends mapped pieces to find candidate splice sites to support a spliced alignment
Strategies for gapped alignments of RNA-seq reads to the genome: Seed-extend approach (11)
Store a map of all small words (k-mers) of similar size in the genome in an efficient lookup data structure
Each read is divided into k-mers, which are mapped to the genome via the lookup structure
Mapped k-mers are extended into larger alignments, which may include gaps flanked by splice sites
Spliced read aligner (TopHat) (11)
RNA-Seq reads are mapped against the whole reference genome
Initially unmapped reads (IUM reads) are ste aside
An initial consensus of mapped regions is computed
Sequences flanking potential donor / acceptor splice sites within neighboring regions are joined to form potential splice junctions
the IUM reads are indexed and aligned to these splice junction sequences
The seed and extend alignment to match reads to possible splice sites (from TopHat?) (11)
For each possible splice site, a seed is formed by combining a small amount of sequence upstream of the donor and downstream of the acceptor
This seed is used to query the index of reads that were not initially mapped
Any read containing the seed is checked for a complete alignment to the exons on either side of the possible splice
In the light gray portion of the alignment, TopHat allows a user-specified number of mismatches
Because reads typically contain low-quality base calls on their 3’-ends, TopHat only examines the first 28 bp on the 5’-end of each read
Transcriptome assembly strategies (11)
Depends on whether a reference genome assembly is available
Three categories:
reference-based strategy
a de novo strategy strategy
combined strategy
Reference-based transcriptome assembly (11)
Steps:
Splice-align reads to the genome (Splice aware aligners: TopHat, Blat, SpliceMap,…)
Build a graph representing alternative splicing events
Traverse the graph to assemble variants (Graph construction and traversal: Cufflinks, Scripture)
Assembled isoforms
Advantages:
Large assembly problem reduced to a small problem, parallel computing
Sensitivity, low abundance transcripts
Contamination, sequencing artifacts not a problem as they do not align to the genome
Disadvantages:
Depends on the quality of the reference genome
very large introns difficult to handle
Multi-mapping reads
Trans-spliced transcripts
Reference genome not always available (but a similar genome can be used)
Applications
Easier to used for simple transcriptomes (bacteria, archaea, lower eukaryotes): few introns and little alternative splicing
Plant and mammalian transcriptomes difficult to assemble due to complex alternative splicing patterns
Splice-aware aligners (11)
Blat
TopHat
SpliceMap
MapSplice
GSNA
Graph construction and traversal methods (11)
Cufflinks
Scripture
Transcriptome assembly if the reference genome is not available (11)
de novo transcriptome assembly
Requirements:
deep sequencing and / or longer reads
thorough quality control
large memory / Multiple processors
Tools:
Velvet / Oases
Trinity
Trans-ABySS
De novo transcriptome assembly (11)
Generate all substrings of length k from the reads
Generate the De Bruijn graph
Collapse the De Bruijn graph
Traverse the graph
Does not require a reference genome
Useful for missing regions of reference genomes
Does not depend correct alignment of reads to splice sites
Long introns not a concern
Huge computing ressources
Needs higher sequencing depth
Sensitive to sequencing errors, especially in low abundance transcripts
Highly similar transcripts (e.g. paralogs) may be merged
Easy for bacterial, archaeal and lower eukaryotic transcriptomes
Higher eukaryotic transcriptomes challenging
Billions of reads
Parallel De Brujin graph implementations available
Combined transcriptome assembly (11)
Quality metrics for assessing transcriptome assemblies (11)
Accuracy
Completeness
Contiguity
Chimerism
Variant resolution
Accuracy (11)
defined as the percentage of the correctly assembled bases estimated using the set of expressed reference transcripts (N).
If reference transcripts are not available, then the reference genome can be used as an alternative
Completeness (11)
defined as the percentage of expressed reference transcripts covered by all the assembled transcripts
Contiguity (11)
defined as the percentage of expressed reference transcripts covered by a single, longest-assembled transcript
Chimerism (11)
the percentage of chimaeras that occur owing to misassemblies among all of the assembled transcripts
a chimeric transcript is one that contains non-repetitive parts from two or more different reference genes
They can arise from biological sources (gene fusions or trans-splicing), experimental sources (intermolecular ligation) or informatics sources (misassemblies)
Misassembled chimeric tzranscripts can be distinguished from truw chimaeras by determining whether the number of reads spanning the chimeric junction is significant when compared to the number of reads spanning other segments of the transcript
Variant resolution (11)
the percentage of transcript variants assembled
this can be calculated by the average of the percentage of assembled variants within the reference set
Estimating transcript expression levels (11)
read counts need to be properly normalized to extract meaningful expression estimates
two main sources of systematic variability that require normalization
RNA fragmentation during library construction causes longer transcripts to generate more reads compared to shorter transcripts present at the same abundance in the sample
the variability in the number of reads produced for each run causes fluctutations in the number of fragments mapped across samples
RPKM
FPKM
TPM
RPKM (11)
reads per kilobase of transcript per million mapped reads
normalizes a transcript’s read count by both its length and the total number of mapped reads in the sample
FPKM (11)
fragments per kilobase of transcript per million mapped reads
when data originate from paired-end sequencing
accounts for the dependency between paired-end reads in the RPKM estimate
metric of choice for both gene and isoform quanitification
TPM (11)
transcripts per million
the only difference from RPKM is that you normalize for gene length first, and then normalize for sequencing depth second
the sum of all TPMs in each sample are the same: easier to compare the proportion of reads that mapped to a gene in each sample
Variation discovery (11)
Types of genomic variation (11)
SNP discovery (11)
Generate sequence reads
Map reads to the reference sequence
Identify differences
Variant call
conclusion that there is a nucleotide difference vs the reference genome at a given position in an individual genome or transcriptome
estimates of variant frequency and confidence measures
Variant information for filtering (11)
Base Qualities: Low quality indicates sequencing error
Read Positions: Bias indicates mapping issues
Genomic Strand: Bias indicates mapping issues
Genomic Position: PCR dupes; self-chain, homopolymers
Mapping Info: aligner-dependent quality score / flags
Removing artifacts (11)
The Variant Call Format (VCF) (11)
in diesem Format werden Varianten gespeichert (Binäre Daten -> brauch Viewer)
wichtiges Text-Format in der Bioinformatik zur Speicherung von Gensequenz-Variationen
Header:
enthält Metadaten, die den Hauptteil der Datei beschreiben
mit # beginnend gekennzeichnet.
Spezielle Schlüsselwörter im Header werden mit ## gekennzeichnet (z.B. fileformat, fileDate, reference…)
Body:
8 Pflichtspalten und eine unbegrenzte Anzahl von optionalen Spalten unterteilt, die zur Aufzeichnung anderer Informationen über die Probe(n) verwendet werden können
CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, FORMAT, SAMPLE1, SAMPLE2
Framework for variation discovery and genotyping from next-generation DNA sequencing (11)
Phase 1: nGS data processing (Typically by lane)
Input: Raw reads -> Mapping -> Local realignment -> Duplicate marking -> Base quality recalibration -> Output: Analysis-ready reads
Phase 2: variant discovery and genotyping (Typically multiple samples simultaneously but can be single alone)
Input: Analysis-ready reads -> Sample 1…N reads -> SNPs / Indels / Structural variation (SV) -> Output: Raw variants
Phase 3: Integrative analysis (Typically multiple samples simultaneously but can be single alone)
Input: Raw indels / SNPs / SVs -> External data: Pedigrees / Known variation / Population structure / Known genotypes -> Variant wuality recalibration <-> Genotype refinement -> Analysis-ready Variants
Single-cell sequencing-based technologies (11)
Single-cell genomics: uncover cell lineage relationships
Single-cell transcriptomics: marker-based cell types
Single-cell epigenomics and proteomics: functional states of individual cells
High-throughput, multi-dimensional analyses of individual cells
Detailed knowledge of the cell lineage trees
Bulk vs single-cell RNA seq (11)
Bulk RNA-seq:
comparative transcriptomics
disease biomarker
homogeneous systems
~20.000 mRNA transcripts
Observe 80-95% of transcripts depending on sequencing depth
single-cell (sc) RNA-seq:
define heterogeneity
identify rare cell population
cell population dynamics
200-10.000 transcripts per cell
Observe 10-50% of the transcriptome
Common applications of single-cell RNA sequencing (11)
Deconvolving heterogeneous cell populations
Trajectory analysis of cell state transitions
Dissecting transcription mechanics
Network interference
Single cell isolation methods (11)
limiting dilution method:
with Pipette and 96-well plate
isolates individual cells, leveraging the statistical distribution of diluted cells
Micromanipulation:
with Microscope and Capillary pipette
involves collecting single cells using mucroscope-guided capillary pipettes
Flow-activated cell sorting:
with Laser, FACS, Multispectral detector + Electronics
isolates highly puified single cells by tagging cells with fluorescent marker proteins
Laser capture microdissection:
with LCM and Cell
utilizes a laser system aided by a computer system to isolate cells from solid samples
Microfluid technology:
with Microfluids, Microparticle and lysis buffer, Cells from suspension, Oil, making droplet with single cell
for single-cell isolation
requires nanolitersized volumes
Single-cell mRNA-seq library preparation with Drop-seq (11)
Cells from suspension
Microparticle and lysis buffer
Oil
1., 2., 3., all together to form Oil droplet with cell and microparticle inside
Cell lysis (in seconds)
RNA hybridization
Break droplets
Reverse transcription with template switching
PCR (single-cell transcriptomes attached to microparticles (STAMPs) as template)
Sequencing and analysis
Each mRNA is mapped to its cell-of-origin and gene-of-origin
Each cell’s pool of mRNA can be analyzed
Big data challenges (11)
Large volume (number of samples and number of transcripts per each sample)
Variety (types of tissues and cells)
Variability due to cellular heterogeneity and different cell-cycle stages
Veracity (missing data, noise, and dropout events)
Human Cell Atlas (11)
stores and provides single-cell data contributed by labs around the world
Anyone can contribute data, find data, or access community
Sequencing depth vs the number of cells (11)
fewer cells, profilied at higher depth per cell: not enough cells sampled of a given type to identify a cluster
more cells, profilied at lower depth per cell: cells of a given type may not share enough transcriptional similarity to be identified as belonging to the same cluster
Dimensionality reduction methods (11)
PCA
t-SNE
UMAP
Principal component analysis (PCA) (11)
finds orthogonal features of maximum variation
t-distributed stochastic neighbor embedding (t-SNE) (11)
learns a low-dimensional embedding in which the distribution of pairwise distances among cells forms a reasonably good information theoretic approximation of the distribution of pairwise distances in the original, high-dimensional space
Batch effects (11)
occurs when non-biological factors in an experiment cause changes in the data produced by the experiment
can lead to inaccurate conclusions when their causes are correlated with one or more outcomes of interest in an experiment
Causes:
different lots of reagents
different instruments
personell differences
Laboratory conditions
N50 (E)
The shortest contig length that needs to be included for covering 50% of the genome.
L50 (E)
The count of smallest number of contigs whose length sum makes up half of genome size.
3 main methods for phylogenetic prediction (E)
Maximum parsimony methods
Distance methods
Maximum likelihood methods
Silva DB (E)
a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data
Alignments in bioinformatics (E)
Local / Global
Pairwise / Multiple Sequence Alignment (MSA)
MSA (E)
more than two sequences are aligned with each other
MSA provide more evolutionary information than p/w alignments
MSAs are used to find patterns of protein families, build phylogenetic trees, annotate new sequences, predict structures
DNA motifs (E)
short, recurring patterns in DNA that are presumed to have a biological function
Shine-Dalgarno (SD) sequence (E)
a ribosomal binding site in bacterial and archaeal messenger RNA, generally located around 8 bases upstream of the start codon AUG
six-base consensus sequence is AGGAGG
MEME suite (E)
Multiple EM for Motif Elicitation
(EM = Expectation Maximization)
Motif Discovery: De novo discovery of motifs in set(s) of sequences
Motif Enrichment: Analyze sequence set(s) for enrichment of known motifs or motifs you provide
Motif Scanning: Find matches to motif(s) in sequences
Motif Comparison: Compare query motif(s) to known motifs.
Output: different statistics, e.g. E-value, Log Likelihood Ratio, Relative Entropy, and the Sequence Logo
GC-content calculation (E)
GC-content = (#G + #C) / length of gene
e.g. = 0.4 = 40%
AsPicDB: Alternative Splicing Prediction Data Base (E)
a database designed to provide access to reliable annotations of the alternative splicing pattern of human genes, obtained by ASPic algorithm and Pintron algorithm, and to the functional annotation of predicted isoforms
a database tool for alternative splicing analysis
PolyPhen-2 (E)
for predicting damaging effects of missense mutations
a tool which predicts possible impact of an amino acid substitution on the structure and function of a human protein using straightforward physical and comparative considerations.
SIFT (E)
a sequence homology-based tool that sorts intolerant from tolerant amino acid substitutions and predicts whether an amino acid substitution in a protein will have a phenotypic effect. SIFT is based on the premise that protein evolution is correlated with protein function. Positions important for function should be conserved in an alignment of the protein family, whereas unimportant positions should appear diverse in an alignment.
Procedure of SIFT: Get related sequences (A PSI-BLAST search against a database is executed on the query sequence) - choose closely related sequences - obtain alignment - calculate probabilities
OMIM (E)
An online catalog of human genes and genetic disorders
BRCA2 (E)
breast cancer 2 gene / protein
DNA damage repair
Mutations in BRCA2 - risk for cancer
Inheritance from parents
What type of the RNA secondary structure is shown?
. ( ( ( ( . . . . . . . . ) ) ) )
a) UGCUAAGCUUUUUUAGC
b) AGGGGAAAAAAAACCCC
(E)
FASTQ file (E)
normally uses 4 lines per sequence:
Line 1 begins with a @ character and is followed by a sequence identifier and an optional description (like a FASTA title line)
Line 2 is the raw sequence letters
Line 3 begins with a + character and is optionally followed by the same sequence identifier (and any description)
Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence
Number of runs (E)
Number of runs = Number of bases that we need to cover / Number of bases in one run
Coverage (E)
Coverage = Number of bases sequenced in the run / Genome size
Number of reads (E)
Number of reads = Number of bases sequences / Number of bases in read
Last changed5 months ago