Lecture & Exercises

Buffl

Methoden der Genomanalyse

by Anton S.

How much of the human genome is transcribed and how much is coding

70% transcribed
2% coding

Does the number of ORFs / genes correlate with genome size ?

Yes
linear correlation

Whats the average gene density of humans and where do we stand compared to other organisms ?

Human 11 genes per MB (10⁶ bases) / Fly 76 / Yeast 496

How much of the human genome is taken up by repeats ?

Name 2 methods for whole genome sequencecing, their pros and cons and how they work

Clone by clone:
- easy assembly
- more work
Whole genome shotgun:
- complicated assembly (tandemly repeated DNA, Genomewide repeats)
- littel preparation needed

Does the organism complexity correlate with the gene number ?

Name some functions of reoccuring DNA patterns in humans

ribosome binding sites
TF binding sites
nuclease binding sites

Give reasons why motif finding in eukaryotes is significantly harder than in prokaryotes

Prokaryotes have few TFs
Prokaryotes have operons
Eukaryotes have more TFs per gene
Eukaryotes have shorter motifs
Eukarytes can have long range effects due to DNA folding
Eukaryotes have non-exact matching mechanism

What does the height of a lettter in a sequence logo encode ?

Why does there need to be a small sample correction ?

For which else does a sequence logo need to be corrected ?

Height shows information content at that position ([0,2])
- If all NUCs are equally represented information is 0
With small samples (especially if sample number not divisible by 4) a information content of 0 is impossible even though it might be true.
- pseudocounts try to adress this by adding 1 for each base in all cases

Different background propabilites

Why is regex unsuited for motiv discovery ?

Regex finds all the patterns, but has no way of distinguishing good from bad hits

What are the componetns of a hidden markov model and how is it decoded ?

Components:
- emission propabilities
- state chagne propabilities
- states
The run an hmm is decoded using dynamic programming and the Verterbi algorithm

How does the expectation maximization algorithm work ?

Make an inital guess about the motif
Estimate the propability for being the motif at each position in each sequence
Compute new propabilites based on the propable hits (factoring hit propablity and nuc frequency)
Update the original motiv mode

Initalization is Key, Gibbs sampling tries to use statistical methods to improve EM

What is the basic workflow for prediction Genes on new procaryotic DNA ?

Translate DNA into all 6 ORFS
use gene prediction programm
Analyze regulatory sequences

Whats the job of the Ribosome?

How does mRNA find the Ribosome ?

Consists of proteins and RNA
Translates mRNA into proteins

The ribosomal binding site is also transcribed from DNA into mRNA

Why is gene finding in Prokaryotes relatively easy ?

The have no introns (uniterupted ORFS)
They have short intergenic regions
They have operons ( multiple genes with the same promoter and regulatory sequence)
They have dense genomes

Give 3 different approaches to Gene prediction

Content based (uses bulk properties of seq)
- e.g ORFS
- Codon usage
- Repeat periodicity (less repeats in genes)
Site based
- Identification of sites that are typical for genes
- Donor acceptor sites
- TF binding sites
- ATG starts
comparative
- Compare the sequence with other genomes and see whether an annoation already exists

What can be said about the length of genes and how can this property be of use for prediction ?

Genes are usually long ORFs
By change we would expect a stop codon every ~ 21 codons (64/3)
Short putative genes are propably false positives

What info does the an intrinsic approach use for gene prediction ?

Give one example

Information that is contained in the sequence
- Unequall AA usage
- Condon perferences (different from organism to organism)
- Unequal codon usage

HMMs model the grammar of genes
- e.g every third NUC is the same (AA is mainly encoded by first 2 codon NUCs)
- assymetry scores are converted to propablilites

Whats the issue with training gene prediction on bulk data ?

Model are insensitive
class specific models increase prediction performance

Which tool uses extrinsic and intrinsic info for prokaryotic gene prediction ?

How does it work ?

ORPHEUS

DPS local alignments (DPS = DNA protein search)
Codon usage analysis (extend local alignments to start/stop condons)
RBS weight matrix ( align -20 to -1 regions and identify ribosome binding sites)
finde the best gene start

Whats the Shine-Dalgarno sequence

the ribosomal biding site
Purine rich region at 5’ end
AGGAG
5-10 bp upstream

What are some prediction tasks on an eukaryotic genome ?

promoter prediction
splice site prediction
gene prediction

Whats the codon bias ?

One AA is primarily encoded by one codon even though there are multiple others.

Do introns split after a codon or not ?

They split wherever they want ( also mid codon)

Name the 3 most important sites of an intron and decribe the splicing process

5’-splice site <exon>GT……<intron>
branch point <intron>……A…….<intron>
3’ splice site <intron>…AG<exon>

The 5’-site is cleaved and loops to the branch point and forms a loop.
The 3’-site is cleaved

Where does the splicing happen ?

spliceosome

Describe how GENESCAN works

Desinged to predict complete gene structures
based on a general propabilistic model of genomic sequence composition

based on a GHMM (generalized HMM)
models 2 strands simultaneously
Emissions are e.g:
- N: intergenic region
- P: promoter
- E: exon
- A: poly a tail

Whats an EST ?

expressed sequence Tag
expressed RNA the has been retranscribed into CDNA and added into library

Whats the exon chaining problem ?

Given a set of putative exons, find the maximun set of non overlapping sequences
Dynamic programming in O(n)

What are some similarity based approaches for gene finding ?

Comparisson to EST databases ( are my ESTs already in a database)
Comparisson of translated seq to known protein sequences
Comparisson to homologic annotated sequences of related organisims

Whats a CIS / TRANS alignment ?

CIS: alinging CDNA to its source genome
TRANS: alinging CDNA to a homologous genome

When doing dual-genome de novo gene prediction what the most important decision to make ?

The choice of the informant genome
To similar ( no new information gain )
To distance ( no simialrities to infer from )
Ideal seq identiy for informant 55% match

Name some metrics to measure genome prediction success

Sensitivity (How many of a all TPs did I get)
Specificitiy (Whats the percentage of TP out of all positives)
Missed Genes ( how many genes didn’t i find)
Wrong Genes (Where did I predict a gene if there wasn’t any)

Joined Genes ( did I join 2 genes that are not connected)
Split Genes ( did I fail to concat 2 exons into one gene)

What are pseudo genes and what are the 2 types ?

Nonfunctional genes, derived from functional genes
have degenerative features, that prevent expression
differ from paralogous at crucial points

Types:
- conventional
- processed (at least 8000)

How are conventional pseudo genes created ?

Gene duplication
One copy is released from evolutionary pressure
Aquires lots of mutations
1. either new gene with similar function
2. conventional non-functional pseudo gene

How are processed pseudogenes created ?

Reinsertion of mRNA into DNA
Has no:
1. introns
2. signaling sequences (TFs, promotors)

Why do pseudo genes need to be analysed ?

high similarity to func genes can interfere with:
- PCR
- in situ hybridization
Provide a molecular record of genome evolution
processed / retrotransposed pseudo genes can be used to verify exon struct predictions

How can Alternate splicing be seen experimentally

RealTime - PCR using primers that flank AS region
- run product through gel and check length

Microarrays
- probes are exon-exon junctions
- denpending on which exon-exon junctions a signal is seen the splice product can be infered

What has the analysis of AS products on protein structure revealed ?

half of all splicing events affect variable regions or entire domains
other half is non-tirvial
- affect conserved and strucutred regions
  - => nonsense products
  - => new functions

Name 3 classification types for processed pseudo genes

TRUE:
1. high seq similarity with swiss-prot / trembl seq
Putative:
1. young pseudogenes without frame distruptions
Distrupted

What is the Ka / Ks ratio ?

Ka: rate of non-synonymous rate of substitution
Ks : rate of synonymous substition
Ka/Ks << 1 “purifying selection”
Ka/Ks >> 1 “diversify protein product” e.g immune system genes
Processed pseudo genes Ka/Ks ~1

What is positive / negative splicing control ?

positive : binding of an activator enables splicing
- default no splicing
negative: binding of an repressor inhibits splicing
- default splicing

Name 2 splicing categories and what they describe

constitutive:
- more than one splicing product is always made from a transcribed gene
regulated:
- different products are generated at different times, under different condtions

Is every splice site consitstenly used ?

No some splice sites are only used some of the time

Name some alternate splicing events and their propablility

Retained intron 3%
Competing 5’-splice sites 18%
- e.g exon has 2 5’splice sites, only one is used ==> different exon length
Competing 3’-splice sites 8%
Exon skipping 38%
Mutally exclusive exons

What does computational splice site identification rely on ?

EST libraries
- better library (coverage) => better prediction

Whats a meassure for dissimilarity between mRNA isoforms ?

Splice junction difference (SJD)
Number of splice junctions that are not on both isoforms / number of all junctions

Whats the relationship between alternate splicing and evolutionary conservation ?

evolutionary splicing is conserved
exons that are not part of constitutive splice forms are mostly not conserved

Name the 3 categories of DNA sequence variations

SNPs (singel nucleotide polymorphisms)
Simple Tandom Repeat Polymorphisms
Insertions / Deletions

Do SNPs occur equally often accross the whole genome ?

No they occur less often in CDS

But they can occur anywhere:
- coding regions
- introns
- regulatroy regions
- intergenic regions

Whats the primary discovery method for SNPs ?

sequenceing and subsequent alinment to references genome

What are sysnonymous and non-synonymous changes ?

Synonymous do not change the AA that is encoded by codon
Non-synonymous change the AA that is encoded by codon

Why is a SNP effect prediction model very usefull ?

experimental charcterization of SNPs is:
- expensive
- time consuming
- difficult

Name 2 broad approaches to predict SNP effects

homology based ( mutations on conserved positions are likely to have a negative effect)
structure based ( does the mutation meaning fully change the structure)

Name some possible disruption to the functionality of Proteins that SNPs can cause

Size change in hydrophobic core ( new AA has larger side chain)
Introduction of buried charged residues
protein-protein interaction
Interferenece with DNA binding
Mutation of catalytic residues

What does SIFT do and why is it fast

SIFT = sorts intorlerant from tolerant
SIFT does a blast search for a given sequence
Gets tolerated / deletrious substitions for every pos
Substitions with norm. props less than a cutoff are decalred deletrious

Fast because it uses homolgy rather than structure

Name some types of non coding RNA

miRNA : translational regulation
siRNA : RNA interference
tRNA : transfer RNA

Why is a tRNA necessary ?

Codons don’t recogize the AA they encode
They recognize the anti codon of a tRNA which brings the AA

Name the main areas of a tRNA and highlight which are most essential

Curcial unpaired regions:
- 3’end - AA bidning
- Anticodon loop - Contains anitcodon for matching
Other features:
- D loop
- T loop

Is there one tRNA for each AA or are there any other mechanisms?

No
One AA can have mutliple tRNAs
Some tRNAs can base pair with more than one codon
Some tRNAs only require accurate matching at the first 2 NUCs of a codon and can wobble on the third

Give RNA basic pariings

A-U
G-C
G-U (wobble pairing, used to form loop in splicing)

Name some secordary structures of RNA

single stranded
double stranded
hairpin loop
buldge loop

Which secondary RNA structure is the hardest to predict

Pseudo knots

Why do primary sequence based method not quite work for RNA?

Evolutionary pressure is on base parinings not sequence
Methods need to capture primary and secondary information

Give the 2 approaches used to predict RNA secondary structure

Energy minimization
- pro: accomondation of experimental and alignment data
- con: not tertiary struct, computationally intensive
use patterns of covariation
- pro: simple
- con: database has to be large enough

Give an RNA secodary prediction algorithm that maximizes base pairings ?

Dynamic programming approach
- memory intesive
- might not create most energetically favorable sturct

What are miRNAs ?

familiy of 21-25 small RNAs
alter the expression of genes in a seq-dependent manner

How are miRNAs created in a cellular environment ?

pri-miRNA is processed at the Drosha/Pasha complex
pre-miRNA is exported from nucleus to cytoplasm (Exportin 5)
pre-miRNA is cleaved by DICER
one pre-miRNA strand is incorporated into RISC

Whats MirScan and how does it work ?

Program for identifiying hairpin loops in C.elegans
How it works:
- Slide a 110-nt window along genome (discard ovious non miRNA)
- compute secondary struture for window with RNA fold
- Identify potential hairpin loop candidates
- Use scoring criteria on candidates to filter

Whats multiplicity and cooperativitiy in miRNAs ?

multiplicity:
- one miRNA can traget multiple genes
cooperativity:
- one gene can be regulated by many miRNAs

Name some basic types of repeats in the human genome

tandemly repeated DNA
LINE
SINE

What are mirco and mini statelites ?

Mircrosatelites: Repeats with a repeating sequence between 1-12 bp
- Can be formed by replication slipage
Minisatelites: 12-500

What are satelite repeats and how do they differ from micro/mini statelites ?

confined to well defined region (e.g telomeric region)
Can span millions of bp and is species specific

Why should repeats be studied ?

repeats are believed to play a significant role in genome evolution and disease
Mobile elements may contain coding regions that are hard to distinguish from other types of genes
repeats induce local alignments complicating assembly

Describe how the k-mer approach for de novo repeat finding works

WTF there is no explanation

Comments:

Anton - hahaha ist Khmer nicht einfach die Suche nach Motiven der Länge k mit n mismatches

Benji - ja aber wan das fürn simpler aaproach

Anton - ich hätte gesagt einfach drüber gehen und schauen welche überrepräsentiert sind für Länge k

What is RepeatMasker

program that uses precompiled libraries to find known repeats
indispensable in genomes with analyzed repeat families
new genome => new library

Whats the difference between class I and class II transposable elements ?

Class I:
- first transcribed from DNA
- reverse transcribed into DNA by TE-encoded RT
- LTR LINE
Class II:
- Move by conservative cut and paste method

Exercise: What is meant by the N50 and L50 numbers in ncbi summaries ?

N50:
- N50 is the shortest contig length that needs to be included for covering 50% of the genome.
- e.g 4,641,652
L50:
- L50 is the count of smallest number of contigs whose length sum makes up half of genome size.
- e.g 1

Both are only relevant if there are multiple contigs

Exercise: Whats a pitfall when trying to translate a DNA sequence ?

using the wrong codon table

Exercise: Whats the MEME suite ?

A software suite for motif:
- discorvery
- enrichment
- scanning

Exercise: Whats the CG-content and how is it calculated ?

(counts of G + Counts of C) / seq length

Exercise: Has this seq an ORF on the opposite strand

5´-TCAGCGTTTCAT-3´

Yes
Opposite strand : 3´-AGTCGCAAAGTA-5´
5´-ATGAAACGCTGA-3´ ATG … TGA

Exercise: Whats GeneMark ?

family of gene prediction programs

Exercise: What’s the Codon adaptation Index

Used to estimate Codon usage bias quantitaively
For each AA:
- for each codon the ratio between itself and most frequent codon is computed
=> most abundant codons have relative adaptiveness of 1
CAI is geometric mean over weigths associated with each codon

Exercise: What happens if GeneMark is run twice on the same sequence but with different species ?

The gene locations stay the same
Classes change

Exercise: What is ASPic

A database for predicted alternate splicing isoforms

Exercise: What does the color denote in graphical output of RNA fold

base-pairing propabilities

Exercise: What is breadth and depth in the context of sequencing coverage ß

breadth: How much of the reference genome was seen during sequenceing
depth: How often was each Nuc sequenced

Explain Clone by Clone sequencing

Genome is broken into chunks of 150000 bp
Chunks locations are mapped to genome
Chunks are inserted into BACs and grow in bacterial cells
Shotgun sequence amplified BACs

Which 2 simple repeats complicate genome assembly ?

tandemly repeated repeats
genome wide repeats

Name some genetic parts that are encoded in a genome

genome sequence variations
protein-coding genes
RNA-coding genes
pseudo genes
promotors / terminators
regulatory elements e.g binding motifs

Name some factors that influence TF binding

stochasitc (are DNA and TF close by chance)
affinity = structural / seq match
complete match / high affinity is not always desireable

Whats the PROSITE DB ?

collection of known DNA motifs from protein sequences

Whats the loggs odds score and why is it needed ?

The propability of a seq x decreases with length
gets infinitely small
Log-odds factor in background distributions and compare seq to random seq as background
Using logs allows for summation

What are some problems for motif discovery ?

The sequence is not known
start is not known
motifs differ from occurance to occurance
how do I discren from random motifs

Whats a consensus string ?

A seq of nucleotides that occur most often at their given position
hard to evaluate how good consensus really is

Explain the gibbs sampler

Gibbs sampler is a stochastic implementation of the EM algorithm

Sketch a rough gene prediction flowchart

Obtain new genomic DNA
Translate into all 6 ORFs and compare to protein seq db
Perform EST db search
Use gene prediction programm
Analyze regulatory sequence of the gene

What are the 3 stop codons ?

Name the 3 different RNA polymerases

RNA polymerase 1 = rRNA
RNA polymerase 2 = mRNA
RNA polymerase 3 = tRNA

How many FPs would one get from simply predicting splice sites with AG GT and what can be done to improve this ?

30-100 FPs for every true one
include surrounding sequence into prediction

Give an approach for exon prediction that has 100% sensitivity

report everything that is flanked by AG and GT
The specificity is 0

Whats Target scan and what does it do ?

Thermodynamic based modelling approach for RNA: RNA duplex interactions
Input:
- miRNAs from multiple organisms
- orthologous 3’ UTR sequences

How does miRNA regulate gene expression and which is more common ?

translational repression
- RISC with miRNA binds to mRNA and prevents translation
mRNA cleavage
- mRNA is cleaved by RISC
translational repression is more common

Name some MirScan ranking criteria

base pairing porpability sum till 21-nt
base parining propablitiy sum from 21-nt
5’ conversvation
3’ conservation
buldge symetry

How many miRNAs does a Human have ?

2300

Whats the difference between interspersed repeats and tandemly repeated DNA

tandemly repeated DNA are sequential grouped together repeats
interspersed repeats are spread accross chromosomes and the entire seq

Whats the difference between retroelements and DNA transposons ?

Retroelements reproduce by reverse transcription followed by integration into DNA
DNA transposons are capable of integrating and excising themselves ( cut-and-paste)

Name the 3 different mechanisms for transposition

Conservative tranposition ( no copy left behind )
Replicative transposition ( copy left behind )
Retrotransposition (RNA intermediary, copy left behind)

Sketch the classification of repeated seuqences

What reputor and what are its Pros and Cons ?

Programm to determine exact repetivie substrings in comple genome
- Idea: exact matches are core of approximate repeats
Pro: short running time o(n)
Con: only exact matches

Whats the Idea behind Repeat finder ?

given all exact repeats
define repeat class by merging and extending them

Name some different types of NGS

Seq by sythesis (Ilumina,Roche)
- Ilumina: shine laser in marked Nuc
- Roche: pyrophosphate release produces light
Ligase-mediated sequencing (Applied biosystems)
Single molecule real Time seuenceing (PAC bio, Oxford nanopore)
- no library applification

Why are seuqenced reads cut and what strategies exist ?

read quality deteriorates to wards the end of read
Fixed length cut off
Adaptive trimming (Quality socre cutoff)

What are barcodes for ?

To distinguish different samples when multiplexing

What are common NGS errors ?

Read errors
Base calling errors
small insertions / deletions

Can DNA seq influence Methylation ?

YES, DNA motifs are involved in regulating DNA methylation

Which programm is used to evaluate the quality of genome assemblies ?

BUSCO

Whats a PAN genome ?

no single reference genome
A set of reference genomes is a Pan genome

Name to minor pseudo gene classes

Unitray Pseudogenes. (deactivated gene, no functional copy)
Polymorphic ( exhibit variation within population

Join Course

Preview

Author

Anton S.

Information

Last changed
a year ago

Report course