Altklausurfragen

Buffl

Genome Analysis

by Mista F.

How does prokaryotic gene density differ from eukaryotic?

Prokaryotes: 1 gene per 1.000-1.400 bases (~ 90%)

Eukaryotes: 1 gene per 100.000 bases (~ 1-2%)

—> Prokaryotes > Eukaryotes

How does an EM-algorithm create a sequence logo?

Expectation:
- estimate the probability of finding the site at any position of the sequences
Maximization: update expected base distributions
Repeat until convergence

e.g., MEME does this

What does the height of a sequence logo represent?

measure of conservation of the base at the position
information content/entropy in bits

Additional answers:

can be corrected by base frequencies of the bases
data might include pseudocounts to overcome effects of missing data
the maximum value for DNA bases is 2 bits. (log2(4))

Why is it essential to search for pseudogenes also?

pseudogenes: Nonfunctional sequences of genomic DNA that are originally derived from functional genes, but exhibit such degenerative features as premature stop codons and frameshift mutations that prevent their expression
might interfere with experiments
- PCR and hybridization experiments
- transcribed pseudogenes
- interference with disease diagnostics and treatment
molecular record of dynamics and evolution of genomes
- rate of nucleotide substitutions
- rate of DNA loss
improvement of gene prediction and annotation efforts

What do “multiplicity” and “co-operativity” mean in the context of miRNA target interactions?

multiplicity: one miRNA can target more than one gene
co-operativity: one gene can be controlled by more than one miRNA

How does the positive prediction value change if the target strongly resembles the informant?

Genvorhersage für D. melanogaster:

too diverged → number of mismatches low because most of sequence can not be aligned
too close → number of mismatches low because sequence is unchanged
for D. melanogaster best acc. with using D. ananassae with ~1 substitution per synonymous site
for Human mouse would be a good informant (~0.6 substitutions per synonymous site)

How can it happen that an alternative exon is added to a transcript that leads to a shortened protein product?

exon has alternative stop codon
alternative exon leads to frame shift → former out of frame stop codon located nearer to the start comes in frame

Name a possible origin of operons.

Rolle des Horizontellen Gentransfers: Vorteil komplette Sets an Genen zu übertragen und dem Empfänger einen definierten Phenotyp zu übertragen
evtl. ausgehend von thermophilen Bakterien

How does increasing the window size affect the positive predictive value of an ORF? / How does increasing the frameshift of the ORF length affect the accuracy of the prediction?

Increasing the Window Size

Increased Sensitivity: More ORFs are detected.
Increased False Positives: More random sequences are detected as ORFs.
Balance between Sensitivity and Specificity: Can affect the positive predictive value (PPV) both positively and negatively.

Increasing the Frameshift (ORF Length)

Increased Error Rate: Frameshift errors lead to incorrect ORF identification.
Shortened or Lengthened ORFs: ORFs are incorrectly classified as false positives or negatives.
Decreased Prediction Accuracy: The accuracy of ORF prediction decreases due to increased frameshift errors.

Summary

Analysis Window: A larger window increases sensitivity and false positives.
Frameshift: Increased frameshift errors decrease prediction accuracy.

What are pseudogenes? What are the two main classes distinguished?

Pseudogene:

non-functional genes
derived from functional genes through mutations
degenerative properties —> inhibit expression
Classes:
- conventional
- processed

Explain the Ka/Ks ratio. What does the value say about conservation, what conclusions can be made about the selection pressure?

Ka - Number of non-synonymous mutations

Ks - Number of synonymous mutations
Higher Ka/Ks —> lower conservation
Ka/Ks = 1 ⇒ no selection pressure
Ka/Ks > 1 ⇒ positive selection pressure (positive selection)
Ka/Ks < 1 ⇒ negative selection pressure (purifying selection)

—> For pseudogenes, Ka/Ks = 1 is expected.

—> Experimental < 1: underestimated Ka/Ks as genes were compared with present-day genes, not the ancestral functional gene that gave rise to the processed pseudogene.

What are the three strategies for gene prediction? Give an example for each.

Content-based: Example (ORFs, codon usage, repeat periodicity, compositional complexity)
Site-based: Example (splice sites, TF binding sites, consensus sequences, polyadenylation signals, start/stop codons)
Comparative: Example (inference based on homology, protein sequence similarity, modular structure of proteins usually precludes finding complete gene)

Assign tasks to given tools.

Motif finding	MEME (Multiple EM for Motif Elucidation) Gibbs sampler (GLAM) FootPrinter (Phylogenetic Footprinting approach)
Representation of motifs	WebLogo3 → Sequence logo PROSITE Database → PROSITE pattern
Gene prediction (prok.)	GeneMark/GeneMark.hmm GLIMMER EcoParse TESTCODE FgenesB ORPHEUS
Gene prediction (eukar.)	only target: Genscan Augustus GeneMarkS (eukaryotic) + extrinsic information: GenomeScan (Blast hits) single informant: Twinscan (splice sites & start/stop from mouse-human) SGP2 (conserved protein coding from mouse-human) multiple informant: N-SCAN (mouse-rat-chicken-human; no improvement over mouse-human as too little divergence) CONTRAST (mouse-opossum-human)
Pseudogene identification	PseudoPipe
SNPs	Tolerance predictors: SIFT (sort intolerant from tolerant substitutions) PolyPhen PANTHER PSEC TopoSNP Protein stability predictors: FoldX, Rosetta Splicing predictors: ESEFinder, Human Splicing finder Cancer variant predictors: FATHMM, CanDrA
RNA structure Prediction	(Base pair maximization) MFOLD (energy minimization) ViennaRNA (energy minimization) RNAfold
miRNA gene prediction	miRscan
Prediction of miRNA targets	TargetScan miRanda
Detection of repeats	Repeat finding: REPuter Clustering: RepeatFinder Repeat masking: RepeatMasker
Read aligner	MAQ (spaced seed indexing) Bowtie (Burrows-Wheeler transformation) splice-aware aligners: TopHat Blat SpliceMap MapSplice GSNA
Graph constuction and transversal	Cufflinks Scripture
de novo transcriptome assembly	Velvet/Oases Trinity Trans-ABySS

What are the properties of a strong promoter?

DNA sequence that facilitates a high rate of transcription

efficiently binds to the RNA polymerase and promotes robust transcription initiation
strong promoter has a high affinity for the RNA polymerase, allowing efficient binding and initiation of transcription
presence of specific sequence motifs within the promoter region

Please state three differences between Whole Genome Shotgun and Clone-by-Clone sequencing.

	Clone-by-Clone	Whole Genome Shotgun
physical mapping	requires construction of clone-based physical map and individual clones are subcloned	mapping phase skipped and subclone library is constructed from entire genome
assembly	easier to resolve complex genomic regions as position of contigs is already known (due to the physical mapping)	order/position of contigs needs to be inferred from overlapping reads and read pairs which can be problematic for tandemly repeated DNA (incorrect overlaps)
labor intensity	physical mapping is labor intensitive, but after mapping clones can be divided between different labs for sequencing (relevant as sequencing was slower at the start of the century)	less labor intensive, but requires more computational resources

Which sequencing method out of Whole Genome and Clone-by-Clone would you use for prokaryotic and eukaryotic genomes?

Historically, clone-by-clone was used more commonly for eukaryotic genes as it allows to overcome challenges with highly repetitive and complex regions in eukaryotic genomes
WGS is particularly suitable for organisms with smaller genomes and less complex genomic structures
Approaches can be combined in a hybrid shotgun-sequencing approach

State four types of alternative splicing events.

How can you detect alternative splicing?

—> AS can be verified by analyzing RNA isoforms

using RT-PCR with primers that flank the alternatively spliced region → different lengths of PCR product
using microarrays (high-throughput approach) with exon-exon junction probes

Please describe the procedure / two effects of alternative splicing. What are the consequences if the protein product becomes larger as a result?

Process of Splicing (two steps):

5 critical bases: 5’ donor SS (GU), branch point (A), 3’ acceptor SS (AG)

first step:
- cleavage at the 5’ SS
- joining of the 5′ end of the intron to an A within the intron (the branch point)
- —> lariat-like (lasso) intermediate —> intron forms a loop
second step:
- cleavage at the 3′ splice site and ligation of the exons
- —> result: excision of the intron as a lariat-like structure

Effects of AS:

multiple isoforms of a gene —> Protein diversity
Tissue-specific regulation of gene expression

Larger product:

Change protein stability
Change enzymatic/ signaling activity

Describe a method to analyze alternative splicing bioinformatically. Specifically, explain the required input data.

Alignment of ESTs (expressed sequence tags) against DNA sequence
Insertions and deletions in the ESTs relative to the mRNA are identified as potential alternative splices
Alternative splices are detected when two splices are mutually exclusive

Requires ESTs, which are cDNA sequences derived from mRNA with reverse transcriptase

Why do genes gather in operons/ what benefits do operons have?

Definition operon:

Multigene bacterial operons have one promoter and one transcriptional stop. The transcript holds more than one gene with multiple translational starts and stops.

Reason:

genes are regulated together → faster adaptation to environmental changes
efficient transcription and translation: genes are controlled by a single promoter region and are transcribed together; the RNA polymerase can process multiple genes in a single pass

Please explain four different methods for gene prediction in prokaryotes.

EcoParse: HMMs for gene prediction with different models for the intergenic region depending on operon or non-operon genes: “long intergenic region” and “short”. (p. 52)
- might show different distribution of base frequencies as regulatory elements are missing for genes in an operon (e.g. no RBS in (-20)...(-1) region of start codon of the second gene)
ORPHEUS: Tool based on intrinsic and extrinsic information
- DPS match → use as seed ORF and refine start and stop of ORF → derive codon usage → derive RBS weight matrix → full set of predicted genes
- detects genes and RBS → can derive: operon or not
GeneMark:
- Fifth-order Markov model
- uses intrinsic information about frequency of hexamers in each of the frames and background
GeneMark.hmm:
- HMM with states for start codons, typical/atypical (e.g. horizontal gene transfer/Class III gene) gene and stop codon for +/- strand
GLIMMER:
- interpolated markov models
- detects patterns present in known gene sequences
TESTCODE:
- every third base tents to be the same much more often than random in coding regions (AA composition bias + codon bias)

Which two classes of information are being used in gene prediction? Also, state two sub-classes for each.

intrinsic information
- exon/intron length distribution
- promoter and polyA signals
- conserved splice signals
- hexamer composition of exons/introns
- reading frame consistency of exons
- isochore differences
extrinsic information
- EST (expressed sequence tag)
- cDNA
- protein-genome alignments

What is the Kozak-Sequenz?

DNA motif for protein translation initiation site in most eukaryotic mRNA transcripts
(region arround start codon)
5'-(gcc)gccRccAUGG-3'
(eukaryotic equivalent to Shine-Dalgarno)

Sketch the architecture of GenScan.

Architecture:

Generalized HMM (GHMM)
models both strands at the same time; from intergenic state model can enter states for + strand genes or - strand genes
states:
- N: intergenic region
- P: promotor (sensor for TATA)
- F: five-prime UTR
- than either single-exon gene or model for multiple exon gene
- single-exon genes are modeled by a single state (Esngl)
- multiple exons:
  - state for initial exon models region from translational start to donor splice site Einit
  - 3 states for different phases of introns (Ik for k: 0: between codons, 1: after first base, 2: after second base)
  - 3 states for exons between introns also for keeping the phase information Ek
- terminal exon Eterm
- T: three-prime UTR
- A: poly-A signal (sensor for Cap signal)
- reverts to N

Explain GeneMarkS.

parallel unsupervised training and prediction
based on GeneMark.hmm architecture:
- non-homogeneous HMM -> coding regions
- homogeneous HMM -> non-coding regions
- coding capacity of sliding windows -> Bayesian decision rule

Explain each step in the given formula and sketch them.

How could the formula be improved?

Base pair maximization: Recursive definition of the best score for a subsequence i,j → four possibilities:

1: i,j are a base pair, added on to a structure for i+1…j-1, add +1
2: i is unpaired, added on to a structure for i+1…j
3: j is unpaired, added on to a structure for i…j-1
4: i,j are paired, but not to each other: the structure for i..j adds together substructures for two sub-sequences, i..k and k+1..j (bifurcation)

Improvements:

It is more plausible that an RNA adopts a globally minimum energy structure, not the structure with the maximum number of base pairs → predict overall free energy

Additionally use thermodynamic information

negative stacking energy for matches
positive destabalizing energies for loops (size-dependend)

What are covariance models, and why are they used? Sketch a structure that cannot be predicted by such methods.

Statistical model that captures the patterns of covariation that can be obtained from an MSA. Covariated bases tend to coevolve as this ensures that the base pair is maintained and RNA structure is conserved. RNA structure prediction can be improved by giving positions with greater covariation more weight.
Describes both the secondary structure and the primary sequence consensus of an RNA
Can be applied to several RNA analysis problems:
- consensus secondary structure prediction
- multiple sequence alignment
- database similarity searching
Iterative training procedure
Optimal algorithm for RNA secondary structure prediction based on pairwise covariations in multiple alignments
Covariation ensures ability to base pair is maintained and RNA structure is conserved

Can’t predict: Pseudoknots

—> violate recursive definition of the optimal score S(i,j)

State the classes of Interspersed Repeats.

Interspersed repeats:
- Retroelements:
  - LINEs (Long Interspersed Nuclear Elements) [autonomous]
  - SINEs (Short Interspersed Nuclear Elements) [nonautonomous]
  - LTRs (Long Terminal Repeat Retrotransposons)
- DNA-Transposons

Name two features of interspersed repeats.

Involve RNA intermediates (Retroelements) or DNA intermediates (DNA transposons)
- Mobility:
  - conservative transposition
  - replicative transposition
  - retrotransposition
Derived from biologically active ‘transposable elements’ (TEs)

Welche drei anderen repetitive Sequenzklassen gibt es noch? Welche Unterschiede gibt es zwischen Interspersed Repeats zu den genannten Formen?

What other three repetitive classes of sequences are there? How are they different from interspersed repeats?

Tandemly repeated DNA (Simple sequence repeats without interruption)
- Microsatellites
  - one to a dozen base pairs
  - may be formed by replication slippage
- Minisatellites
  - a dozen to 500 base pairs
- Cryptically simple repeats
- Low complexity repeats
- Satellite and telomeric repeats
Segemented duplications
- nearly identical copies ranging in size from 1 to >200 kb
- originate from duplicative transpositions
Pseudogenes
- derived from functional genes but with deleterious mutation

What are SNPs?

Single nucleotide polymorphisms (SNPs)
occurs when a single nucleotide replaces one of the other three nucleotide letters. SNPs found in a coding seq are of great interest as they are more likely to alter function of a protein.
most common type of genetic variation in humans.
account for 90% of the variation between individuals.

Which two types of SNPs are there and what are the differences?

Synonymous:
- not causing a change in the amino acid
Non-synonymous:
- A nonsynonymous or missense variant is a single base change in a coding region that causes an amino acid change in the corresponding protein

Explain the difference between transition and transversion in base changes.

transition: changes a purine to another purine (A ↔ G), or a pyrimidine to another pyrimidine (C ↔ T)
transversion: change from purine (A/G) to pyrimidine (T/C) or vice versa.

How can SNPs be linked to disease?

SNPs may be informative with respect to disease:

Functional variation. A SNP associated with a nonsynonymous substitution in a coding region will change the amino acid sequence of a protein.

Regulatory variation. A SNP in a noncoding region can influence gene expression.

Association. SNPs can be used in whole-genome association studies. SNP frequency is compared between affected and control populations.

Explain the differences between miRNA in animals and plants.

Number of miRNA genes present:
- Plants: 100-200 genes
- Animals: 100-500
Location within genome:
- Plants: predominantly intergenic regions
- Animals: intergenic regions, introns

Presence of miRNA clusters:
- Plants: uncommon
- Animals: common
miRNA biosynthesis:
- Plants: Dicer-like
- Animals: Drosha, Dicer
Mechanism of repression:
- Plants: mRNA-cleavage (methylation?)
- Animals: Translational repression
Location of miRNA-binding motifs:
- Plants: predominantly in the ORF
- Animals: predominantly in the 3’-UTR
Number of miRNA-binding sites within target sites:
- Plants: Generally one
- Animals: Generally multiple
Function of known target genes:
- Plants: Regulatory genes - crucial for development, enzymes
- Animals: Regulatory genes - crucial for development, structural proteins, enzymes

Describe the targetScan algorithm.

TargetScan
- thermodynamics-based RNA:RNA duplex interactions
- comparative sequence analysis
- Input:
  - miRNA that is conserved in multiple species
  - set of 3’UTR sequences from these species
- Method:
  - check miRNA seed region (2-8 bases) perfect complementarity to 3’UTR
  - extend to G:U pairs but no mismatches
  - assign folding free energy G to miRNA:target
  - assign Z score to each UTR
  - sort UTRs by Z score -> assign rank
  - compare organisms -> conserved miRNAs

What kind of data do you need for targetScan? Do these data types have disadvantages?

Input:

miRNA that is conserved in multiple organisms
a set of orthologous 3‘ UTR sequences from these organisms

Disadvantages:

Incompleteness of orthologous gene annotations
Some targets may not meet the stringent seed matching, Z score, or rank criteria
Some target sites may lie outside the 3‘ UTR (plants)
Some targets may not be conserved in the complete set of organisms

⇒ The actual number of target genes regulated by each miRNA is likely to be substantially higher

Name two pros and cons for microarray and RNA-seq.

Hybridization (Microarrays):
- Pro:
  - Relatively low cost
  - Well established in clinical use
- Con:
  - Analysis only of pre-defined sequences
  - Dynamic range limited by scanner
  - high background-noise
  - cross-hybridization möglich
Seqeunce-based (RNA-seq):
- Pro:
  - identifizierung alternativer Splicevarianten/neue Transkripte
  - hohe sensitivität
- Con:
  - relatively high cost
  - high computational effort
  - prone to contamination

Give an overview of the experimental steps in an RNA sequencing (RNA-seq) protocol.

RNA extraction → target enrichment → cDNA → library prep → sequencing → Transcriptome/genome mapping → data analysis

Experimental design: number of replicates, depth of sequencing
Parameters: alignment rate, desired power, significance level, log-fold change

RNA-seq workflow

Quality control
Alignment of reads to reference genome
Transcriptome assembly
Differential expression

State three differences between pro- and eukaryotic genomes.

Feature	Prokaryotes	Eukaryotes
Size	Between 1s and 10s of Mb	Between 1s and 1,000s of Mb
Topology	Mostly circular	Mostly linear
Gene number	Most < 10,000	Often > 10,000
Pseudogenes	Few	Many
Complexity	Low	High
Horizontal gene transfer	Frequent	Rare
Intergenic regions	Short (<100 kb)	Long (often >100 kb)
Genome duplication	None	Frequent (especially in plants)
Gene duplication	Rare	Frequent
Repeated sequences	Minor components	Major components

Explain the FASTQ format.

Simple extension of FASTA —> store quality of bases

@ = ID
Sequence
+ = ID
Quality scores —> PHRED scores (encoded in ASCII letters, 0-93

Below is the output of the GeneMark.hmm program. Please explain what the column “Strand” means.

The strand column represents the strand of DNA, i.e., forward/ reverse, where the gene is on

What is the normalization for the sequence length in NGS, and what is it used for?

Normalization for sequence length:

variations in read length across different samples or experiments
—> adjusting/standardizing length of reads or fragments

Purpose:

Comparability Across Samples
Data Quality Control
Accurate Quantification
Alignment Efficiency
Bias Reduction

Methods:

Trimming —> removing bases from the end, e.g., Trimmomatic
Subsampling -> select reads that are within common distribution
Length Filtering
Statistical Normalization -> TPM, etc.

What are N50 and L50 measures?

N50: length of the contig/scaffold at which 50% of the assembly is covered. Higher N50 values —> better assembly quality
L50: number of contigs/scaffolds needed to cover 50% of the assembly. Lower L50 values —> better assembly quality

Explain one approach to RNA secondary structure prediction.

predict the most stable secondary structure —> principle of free energy minimization
—> Zuker algorithm (dynamic)
1. Initialize matrix to store energy values
2. Specify the allowable base pair rules (A-U, G-C, G-U)
3. Fill the Matrix
  1. Energy of different types of loops (hairpin, bulge, interior, and multibranch loops)
  2. energy for stacking adjacent base pairs
4. Traceback lowest energy —> secondary structure

Explain the similarity-based approach to gene prediction.

Genes in different organisms are similar
uses known genes in one genome to predict (unknown) genes in another genome
-> known gene and a genome sequence -> find set of substrings of the genomic sequence whose concatenation best fits the gene

Name and explain two types of Alternative Splicing.

constitutive: > 1 product is always made from transcribed gene
regulated: different forms under different conditions/times/cell types

Explain the k-mer approach to finding the repeats in genomes.

split genome into k-mers of length k
count occurrences of each k-mer
filter for heavily repeated k-mers

Pros:

easy to implement/understand
scalable to large genomes

Cons:

manual k —> choosing right k is critical
high memory usage (large genome, low k)

What is repeat masking and why is it needed?

identify and mask repetitive genomic DNA
lower case = masked, upper case = not

Advantages:

significantly reduces size of genome —> faster/more efficient analysis
improve alignment: repeats can cause mismatches
improve gene prediction:
- repeats can confuse algorithms -> false positives/missing real genes
- focus on unique sequences —> better accuracy
Reducing False Positives in Functional Genomics
- e.g. motif analysis

Tools: RepeatMasker

What are the challenges in identifying motifs in biological sequences?

Sequence length —> long genomes require significant computing time
Motif Diversity —> shorter motifs = harder to detect against background
Noise and Background —> true motifs vs random sequence
Biological Context —> functional relevance of identified motifs

Explain the main steps of the genome assembly process.

What are two assumptions that can be made when predicting the role of amino acids substitution?

Conservation of Functional Sites: Highly conserved regions are critical for function, and substitutions here are likely impactful
Physicochemical Properties: Substitutions that significantly alter size, charge, hydrophobicity, or polarity are likely to affect protein function or stability.

Explain the TargetScan algorithm for predicting mammalian MicroRNA Targets (you can skip the formulas).

Input:
- miRNA sequence (conserved in multiple species)
- 3’ UTR sequences from these organisms
search the UTRs in the first organism for segments of perfect Watson-Crick complementarity to bases 2–8 of the miRNA: “miRNA seed” and “seed matches”
extend each seed match with additional base pairs to the miRNA as far as possible in each direction, allowing G:U pairs, but stopping at mismatches
optimize base pairing of the remaining 3‘ portion of the miRNA to the 35 bases of the UTR immediately 5‘ of each seed match using the RNAfold program
assign a folding free energy G to each such miRNA:target site interaction
assign a Z score to each UTR

What is pan-genome?

entire set of genes from all strains within a clade
more generally -> union of genomes

Join Course

Preview

Author

Mista F.

Information

Last changed
2 years ago

Report course