How does prokaryotic gene density differ from eukaryotic?
Prokaryotes: 1 gene per 1.000-1.400 bases (~ 90%)
Eukaryotes: 1 gene per 100.000 bases (~ 1-2%)
—> Prokaryotes > Eukaryotes
How does an EM-algorithm create a sequence logo?
Expectation:
estimate the probability of finding the site at any position of the sequences
Maximization: update expected base distributions
Repeat until convergence
e.g., MEME does this
What does the height of a sequence logo represent?
measure of conservation of the base at the position
information content/entropy in bits
Additional answers:
can be corrected by base frequencies of the bases
data might include pseudocounts to overcome effects of missing data
the maximum value for DNA bases is 2 bits. (log2(4))
Why is it essential to search for pseudogenes also?
pseudogenes: Nonfunctional sequences of genomic DNA that are originally derived from functional genes, but exhibit such degenerative features as premature stop codons and frameshift mutations that prevent their expression
might interfere with experiments
PCR and hybridization experiments
transcribed pseudogenes
interference with disease diagnostics and treatment
molecular record of dynamics and evolution of genomes
rate of nucleotide substitutions
rate of DNA loss
improvement of gene prediction and annotation efforts
What do “multiplicity” and “co-operativity” mean in the context of miRNA target interactions?
multiplicity: one miRNA can target more than one gene
co-operativity: one gene can be controlled by more than one miRNA
How does the positive prediction value change if the target strongly resembles the informant?
Genvorhersage für D. melanogaster:
too diverged → number of mismatches low because most of sequence can not be aligned
too close → number of mismatches low because sequence is unchanged
for D. melanogaster best acc. with using D. ananassae with ~1 substitution per synonymous site
for Human mouse would be a good informant (~0.6 substitutions per synonymous site)
How can it happen that an alternative exon is added to a transcript that leads to a shortened protein product?
exon has alternative stop codon
alternative exon leads to frame shift → former out of frame stop codon located nearer to the start comes in frame
Name a possible origin of operons.
Rolle des Horizontellen Gentransfers: Vorteil komplette Sets an Genen zu übertragen und dem Empfänger einen definierten Phenotyp zu übertragen
evtl. ausgehend von thermophilen Bakterien
How does increasing the window size affect the positive predictive value of an ORF? / How does increasing the frameshift of the ORF length affect the accuracy of the prediction?
Increasing the Window Size
Increased Sensitivity: More ORFs are detected.
Increased False Positives: More random sequences are detected as ORFs.
Balance between Sensitivity and Specificity: Can affect the positive predictive value (PPV) both positively and negatively.
Increasing the Frameshift (ORF Length)
Increased Error Rate: Frameshift errors lead to incorrect ORF identification.
Shortened or Lengthened ORFs: ORFs are incorrectly classified as false positives or negatives.
Decreased Prediction Accuracy: The accuracy of ORF prediction decreases due to increased frameshift errors.
Summary
Analysis Window: A larger window increases sensitivity and false positives.
Frameshift: Increased frameshift errors decrease prediction accuracy.
What are pseudogenes? What are the two main classes distinguished?
Pseudogene:
non-functional genes
derived from functional genes through mutations
degenerative properties —> inhibit expression
Classes:
conventional
processed
Explain the Ka/Ks ratio. What does the value say about conservation, what conclusions can be made about the selection pressure?
Ka - Number of non-synonymous mutations
Ks - Number of synonymous mutations
Higher Ka/Ks —> lower conservation
Ka/Ks = 1 ⇒ no selection pressure
Ka/Ks > 1 ⇒ positive selection pressure (positive selection)
Ka/Ks < 1 ⇒ negative selection pressure (purifying selection)
—> For pseudogenes, Ka/Ks = 1 is expected.
—> Experimental < 1: underestimated Ka/Ks as genes were compared with present-day genes, not the ancestral functional gene that gave rise to the processed pseudogene.
What are the three strategies for gene prediction? Give an example for each.
Content-based: Example (ORFs, codon usage, repeat periodicity, compositional complexity)
Site-based: Example (splice sites, TF binding sites, consensus sequences, polyadenylation signals, start/stop codons)
Comparative: Example (inference based on homology, protein sequence similarity, modular structure of proteins usually precludes finding complete gene)
Assign tasks to given tools.
Motif finding
MEME (Multiple EM for Motif Elucidation)
Gibbs sampler (GLAM)
FootPrinter (Phylogenetic Footprinting approach)
Representation of motifs
WebLogo3 → Sequence logo
PROSITE Database → PROSITE pattern
Gene prediction (prok.)
GeneMark/GeneMark.hmm
GLIMMER
EcoParse
TESTCODE
FgenesB
ORPHEUS
Gene prediction (eukar.)
only target:
Genscan
Augustus
GeneMarkS (eukaryotic)
+ extrinsic information:
GenomeScan (Blast hits)
single informant:
Twinscan (splice sites & start/stop from mouse-human)
SGP2 (conserved protein coding from mouse-human)
multiple informant:
N-SCAN (mouse-rat-chicken-human; no improvement over mouse-human as too little divergence)
CONTRAST (mouse-opossum-human)
Pseudogene identification
PseudoPipe
SNPs
Tolerance predictors:
SIFT (sort intolerant from tolerant substitutions)
PolyPhen
PANTHER PSEC
TopoSNP
Protein stability predictors:
FoldX, Rosetta
Splicing predictors:
ESEFinder, Human Splicing finder
Cancer variant predictors:
FATHMM, CanDrA
RNA structure Prediction
(Base pair maximization)
MFOLD (energy minimization)
ViennaRNA (energy minimization)
RNAfold
miRNA gene prediction
miRscan
Prediction of miRNA targets
TargetScan
miRanda
Detection of repeats
Repeat finding:
REPuter
Clustering:
RepeatFinder
Repeat masking:
RepeatMasker
Read aligner
MAQ (spaced seed indexing)
Bowtie (Burrows-Wheeler transformation)
splice-aware aligners:
TopHat
Blat
SpliceMap
MapSplice
GSNA
Graph constuction and transversal
Cufflinks
Scripture
de novo transcriptome assembly
Velvet/Oases
Trinity
Trans-ABySS
What are the properties of a strong promoter?
DNA sequence that facilitates a high rate of transcription
efficiently binds to the RNA polymerase and promotes robust transcription initiation
strong promoter has a high affinity for the RNA polymerase, allowing efficient binding and initiation of transcription
presence of specific sequence motifs within the promoter region
Please state three differences between Whole Genome Shotgun and Clone-by-Clone sequencing.
Clone-by-Clone
Whole Genome Shotgun
physical mapping
requires construction of clone-based physical map and individual clones are subcloned
mapping phase skipped and subclone library is constructed from entire genome
assembly
easier to resolve complex genomic regions as position of contigs is already known (due to the physical mapping)
order/position of contigs needs to be inferred from overlapping reads and read pairs which can be problematic for tandemly repeated DNA (incorrect overlaps)
labor intensity
physical mapping is labor intensitive, but after mapping clones can be divided between different labs for sequencing (relevant as sequencing was slower at the start of the century)
less labor intensive, but requires more computational resources
Which sequencing method out of Whole Genome and Clone-by-Clone would you use for prokaryotic and eukaryotic genomes?
Historically, clone-by-clone was used more commonly for eukaryotic genes as it allows to overcome challenges with highly repetitive and complex regions in eukaryotic genomes
WGS is particularly suitable for organisms with smaller genomes and less complex genomic structures
Approaches can be combined in a hybrid shotgun-sequencing approach
State four types of alternative splicing events.
How can you detect alternative splicing?
—> AS can be verified by analyzing RNA isoforms
using RT-PCR with primers that flank the alternatively spliced region → different lengths of PCR product
using microarrays (high-throughput approach) with exon-exon junction probes
Please describe the procedure / two effects of alternative splicing. What are the consequences if the protein product becomes larger as a result?
Process of Splicing (two steps):
5 critical bases: 5’ donor SS (GU), branch point (A), 3’ acceptor SS (AG)
first step:
cleavage at the 5’ SS
joining of the 5′ end of the intron to an A within the intron (the branch point)
—> lariat-like (lasso) intermediate —> intron forms a loop
second step:
cleavage at the 3′ splice site and ligation of the exons
—> result: excision of the intron as a lariat-like structure
Effects of AS:
multiple isoforms of a gene —> Protein diversity
Tissue-specific regulation of gene expression
Larger product:
Change protein stability
Change enzymatic/ signaling activity
Describe a method to analyze alternative splicing bioinformatically. Specifically, explain the required input data.
Alignment of ESTs (expressed sequence tags) against DNA sequence
Insertions and deletions in the ESTs relative to the mRNA are identified as potential alternative splices
Alternative splices are detected when two splices are mutually exclusive
Requires ESTs, which are cDNA sequences derived from mRNA with reverse transcriptase
Why do genes gather in operons/ what benefits do operons have?
Definition operon:
Multigene bacterial operons have one promoter and one transcriptional stop. The transcript holds more than one gene with multiple translational starts and stops.
Reason:
genes are regulated together → faster adaptation to environmental changes
efficient transcription and translation: genes are controlled by a single promoter region and are transcribed together; the RNA polymerase can process multiple genes in a single pass
Please explain four different methods for gene prediction in prokaryotes.
EcoParse: HMMs for gene prediction with different models for the intergenic region depending on operon or non-operon genes: “long intergenic region” and “short”. (p. 52)
might show different distribution of base frequencies as regulatory elements are missing for genes in an operon (e.g. no RBS in (-20)...(-1) region of start codon of the second gene)
ORPHEUS: Tool based on intrinsic and extrinsic information
DPS match → use as seed ORF and refine start and stop of ORF → derive codon usage → derive RBS weight matrix → full set of predicted genes
detects genes and RBS → can derive: operon or not
GeneMark:
Fifth-order Markov model
uses intrinsic information about frequency of hexamers in each of the frames and background
GeneMark.hmm:
HMM with states for start codons, typical/atypical (e.g. horizontal gene transfer/Class III gene) gene and stop codon for +/- strand
GLIMMER:
interpolated markov models
detects patterns present in known gene sequences
TESTCODE:
every third base tents to be the same much more often than random in coding regions (AA composition bias + codon bias)
Which two classes of information are being used in gene prediction? Also, state two sub-classes for each.
intrinsic information
exon/intron length distribution
promoter and polyA signals
conserved splice signals
hexamer composition of exons/introns
reading frame consistency of exons
isochore differences
extrinsic information
EST (expressed sequence tag)
cDNA
protein-genome alignments
What is the Kozak-Sequenz?
DNA motif for protein translation initiation site in most eukaryotic mRNA transcripts
(region arround start codon)
5'-(gcc)gccRccAUGG-3'
(eukaryotic equivalent to Shine-Dalgarno)
Sketch the architecture of GenScan.
Architecture:
Generalized HMM (GHMM)
models both strands at the same time; from intergenic state model can enter states for + strand genes or - strand genes
states:
N: intergenic region
P: promotor (sensor for TATA)
F: five-prime UTR
than either single-exon gene or model for multiple exon gene
single-exon genes are modeled by a single state (Esngl)
multiple exons:
state for initial exon models region from translational start to donor splice site Einit
3 states for different phases of introns (Ik for k: 0: between codons, 1: after first base, 2: after second base)
3 states for exons between introns also for keeping the phase information Ek
terminal exon Eterm
T: three-prime UTR
A: poly-A signal (sensor for Cap signal)
reverts to N
Explain GeneMarkS.
parallel unsupervised training and prediction
based on GeneMark.hmm architecture:
non-homogeneous HMM -> coding regions
homogeneous HMM -> non-coding regions
coding capacity of sliding windows -> Bayesian decision rule
Explain each step in the given formula and sketch them.
How could the formula be improved?
Base pair maximization: Recursive definition of the best score for a subsequence i,j → four possibilities:
1: i,j are a base pair, added on to a structure for i+1…j-1, add +1
2: i is unpaired, added on to a structure for i+1…j
3: j is unpaired, added on to a structure for i…j-1
4: i,j are paired, but not to each other: the structure for i..j adds together substructures for two sub-sequences, i..k and k+1..j (bifurcation)
Improvements:
It is more plausible that an RNA adopts a globally minimum energy structure, not the structure with the maximum number of base pairs → predict overall free energy
Additionally use thermodynamic information
negative stacking energy for matches
positive destabalizing energies for loops (size-dependend)
What are covariance models, and why are they used? Sketch a structure that cannot be predicted by such methods.
Statistical model that captures the patterns of covariation that can be obtained from an MSA. Covariated bases tend to coevolve as this ensures that the base pair is maintained and RNA structure is conserved. RNA structure prediction can be improved by giving positions with greater covariation more weight.
Describes both the secondary structure and the primary sequence consensus of an RNA
Can be applied to several RNA analysis problems:
consensus secondary structure prediction
multiple sequence alignment
database similarity searching
Iterative training procedure
Optimal algorithm for RNA secondary structure prediction based on pairwise covariations in multiple alignments
Covariation ensures ability to base pair is maintained and RNA structure is conserved
Can’t predict: Pseudoknots
—> violate recursive definition of the optimal score S(i,j)
State the classes of Interspersed Repeats.
Interspersed repeats:
Retroelements:
LINEs (Long Interspersed Nuclear Elements) [autonomous]
SINEs (Short Interspersed Nuclear Elements) [nonautonomous]
LTRs (Long Terminal Repeat Retrotransposons)
DNA-Transposons
Name two features of interspersed repeats.
Involve RNA intermediates (Retroelements) or DNA intermediates (DNA transposons)
Mobility:
conservative transposition
replicative transposition
retrotransposition
Derived from biologically active ‘transposable elements’ (TEs)
Welche drei anderen repetitive Sequenzklassen gibt es noch? Welche Unterschiede gibt es zwischen Interspersed Repeats zu den genannten Formen?
What other three repetitive classes of sequences are there? How are they different from interspersed repeats?
Tandemly repeated DNA (Simple sequence repeats without interruption)
Microsatellites
one to a dozen base pairs
may be formed by replication slippage
Minisatellites
a dozen to 500 base pairs
Cryptically simple repeats
Low complexity repeats
Satellite and telomeric repeats
Segemented duplications
nearly identical copies ranging in size from 1 to >200 kb
originate from duplicative transpositions
Pseudogenes
derived from functional genes but with deleterious mutation
What are SNPs?
Single nucleotide polymorphisms (SNPs)
occurs when a single nucleotide replaces one of the other three nucleotide letters. SNPs found in a coding seq are of great interest as they are more likely to alter function of a protein.
most common type of genetic variation in humans.
account for 90% of the variation between individuals.
Which two types of SNPs are there and what are the differences?
Synonymous:
not causing a change in the amino acid
Non-synonymous:
A nonsynonymous or missense variant is a single base change in a coding region that causes an amino acid change in the corresponding protein
Explain the difference between transition and transversion in base changes.
transition: changes a purine to another purine (A ↔ G), or a pyrimidine to another pyrimidine (C ↔ T)
transversion: change from purine (A/G) to pyrimidine (T/C) or vice versa.
How can SNPs be linked to disease?
SNPs may be informative with respect to disease:
Functional variation. A SNP associated with a nonsynonymous substitution in a coding region will change the amino acid sequence of a protein.
Regulatory variation. A SNP in a noncoding region can influence gene expression.
Association. SNPs can be used in whole-genome association studies. SNP frequency is compared between affected and control populations.
Explain the differences between miRNA in animals and plants.
Number of miRNA genes present:
Plants: 100-200 genes
Animals: 100-500
Location within genome:
Plants: predominantly intergenic regions
Animals: intergenic regions, introns
Presence of miRNA clusters:
Plants: uncommon
Animals: common
miRNA biosynthesis:
Plants: Dicer-like
Animals: Drosha, Dicer
Mechanism of repression:
Plants: mRNA-cleavage (methylation?)
Animals: Translational repression
Location of miRNA-binding motifs:
Plants: predominantly in the ORF
Animals: predominantly in the 3’-UTR
Number of miRNA-binding sites within target sites:
Plants: Generally one
Animals: Generally multiple
Function of known target genes:
Plants: Regulatory genes - crucial for development, enzymes
Animals: Regulatory genes - crucial for development, structural proteins, enzymes
Describe the targetScan algorithm.
thermodynamics-based RNA:RNA duplex interactions
comparative sequence analysis
Input:
miRNA that is conserved in multiple species
set of 3’UTR sequences from these species
Method:
check miRNA seed region (2-8 bases) perfect complementarity to 3’UTR
extend to G:U pairs but no mismatches
assign folding free energy G to miRNA:target
assign Z score to each UTR
sort UTRs by Z score -> assign rank
compare organisms -> conserved miRNAs
What kind of data do you need for targetScan? Do these data types have disadvantages?
miRNA that is conserved in multiple organisms
a set of orthologous 3‘ UTR sequences from these organisms
Disadvantages:
Incompleteness of orthologous gene annotations
Some targets may not meet the stringent seed matching, Z score, or rank criteria
Some target sites may lie outside the 3‘ UTR (plants)
Some targets may not be conserved in the complete set of organisms
⇒ The actual number of target genes regulated by each miRNA is likely to be substantially higher
Name two pros and cons for microarray and RNA-seq.
Hybridization (Microarrays):
Pro:
Relatively low cost
Well established in clinical use
Con:
Analysis only of pre-defined sequences
Dynamic range limited by scanner
high background-noise
cross-hybridization möglich
Seqeunce-based (RNA-seq):
identifizierung alternativer Splicevarianten/neue Transkripte
hohe sensitivität
relatively high cost
high computational effort
prone to contamination
Give an overview of the experimental steps in an RNA sequencing (RNA-seq) protocol.
RNA extraction → target enrichment → cDNA → library prep → sequencing → Transcriptome/genome mapping → data analysis
Experimental design: number of replicates, depth of sequencing
Parameters: alignment rate, desired power, significance level, log-fold change
RNA-seq workflow
Quality control
Alignment of reads to reference genome
Transcriptome assembly
Differential expression
State three differences between pro- and eukaryotic genomes.
Feature
Prokaryotes
Eukaryotes
Size
Between 1s and 10s of Mb
Between 1s and 1,000s of Mb
Topology
Mostly circular
Mostly linear
Gene number
Most < 10,000
Often > 10,000
Few
Many
Complexity
Low
High
Horizontal gene transfer
Frequent
Rare
Intergenic regions
Short (<100 kb)
Long (often >100 kb)
Genome duplication
None
Frequent (especially in plants)
Gene duplication
Repeated sequences
Minor components
Major components
Explain the FASTQ format.
Simple extension of FASTA —> store quality of bases
@ = ID
Sequence
+ = ID
Quality scores —> PHRED scores (encoded in ASCII letters, 0-93
Below is the output of the GeneMark.hmm program. Please explain what the column “Strand” means.
The strand column represents the strand of DNA, i.e., forward/ reverse, where the gene is on
What is the normalization for the sequence length in NGS, and what is it used for?
Normalization for sequence length:
variations in read length across different samples or experiments
—> adjusting/standardizing length of reads or fragments
Purpose:
Comparability Across Samples
Data Quality Control
Accurate Quantification
Alignment Efficiency
Bias Reduction
Methods:
Trimming —> removing bases from the end, e.g., Trimmomatic
Subsampling -> select reads that are within common distribution
Length Filtering
Statistical Normalization -> TPM, etc.
What are N50 and L50 measures?
N50: length of the contig/scaffold at which 50% of the assembly is covered. Higher N50 values —> better assembly quality
L50: number of contigs/scaffolds needed to cover 50% of the assembly. Lower L50 values —> better assembly quality
Explain one approach to RNA secondary structure prediction.
predict the most stable secondary structure —> principle of free energy minimization
—> Zuker algorithm (dynamic)
Initialize matrix to store energy values
Specify the allowable base pair rules (A-U, G-C, G-U)
Fill the Matrix
Energy of different types of loops (hairpin, bulge, interior, and multibranch loops)
energy for stacking adjacent base pairs
Traceback lowest energy —> secondary structure
Explain the similarity-based approach to gene prediction.
Genes in different organisms are similar
uses known genes in one genome to predict (unknown) genes in another genome
-> known gene and a genome sequence -> find set of substrings of the genomic sequence whose concatenation best fits the gene
Name and explain two types of Alternative Splicing.
constitutive: > 1 product is always made from transcribed gene
regulated: different forms under different conditions/times/cell types
Explain the k-mer approach to finding the repeats in genomes.
split genome into k-mers of length k
count occurrences of each k-mer
filter for heavily repeated k-mers
Pros:
easy to implement/understand
scalable to large genomes
Cons:
manual k —> choosing right k is critical
high memory usage (large genome, low k)
What is repeat masking and why is it needed?
identify and mask repetitive genomic DNA
lower case = masked, upper case = not
Advantages:
significantly reduces size of genome —> faster/more efficient analysis
improve alignment: repeats can cause mismatches
improve gene prediction:
repeats can confuse algorithms -> false positives/missing real genes
focus on unique sequences —> better accuracy
Reducing False Positives in Functional Genomics
e.g. motif analysis
Tools: RepeatMasker
What are the challenges in identifying motifs in biological sequences?
Sequence length —> long genomes require significant computing time
Motif Diversity —> shorter motifs = harder to detect against background
Noise and Background —> true motifs vs random sequence
Biological Context —> functional relevance of identified motifs
Explain the main steps of the genome assembly process.
What are two assumptions that can be made when predicting the role of amino acids substitution?
Conservation of Functional Sites: Highly conserved regions are critical for function, and substitutions here are likely impactful
Physicochemical Properties: Substitutions that significantly alter size, charge, hydrophobicity, or polarity are likely to affect protein function or stability.
Explain the TargetScan algorithm for predicting mammalian MicroRNA Targets (you can skip the formulas).
miRNA sequence (conserved in multiple species)
3’ UTR sequences from these organisms
search the UTRs in the first organism for segments of perfect Watson-Crick complementarity to bases 2–8 of the miRNA: “miRNA seed” and “seed matches”
extend each seed match with additional base pairs to the miRNA as far as possible in each direction, allowing G:U pairs, but stopping at mismatches
optimize base pairing of the remaining 3‘ portion of the miRNA to the 35 bases of the UTR immediately 5‘ of each seed match using the RNAfold program
assign a folding free energy G to each such miRNA:target site interaction
assign a Z score to each UTR
What is pan-genome?
entire set of genes from all strains within a clade
more generally -> union of genomes
Zuletzt geändertvor 4 Monaten