How much of the human genome is transcribed and how much is coding
70% transcribed
2% coding
Does the number of ORFs / genes correlate with genome size ?
Yes
linear correlation
Whats the average gene density of humans and where do we stand compared to other organisms ?
Human 11 genes per MB (10⁶ bases) / Fly 76 / Yeast 496
How much of the human genome is taken up by repeats ?
44%
Name 2 methods for whole genome sequencecing, their pros and cons and how they work
Clone by clone:
easy assembly
more work
Whole genome shotgun:
complicated assembly (tandemly repeated DNA, Genomewide repeats)
littel preparation needed
Does the organism complexity correlate with the gene number ?
No
Name some functions of reoccuring DNA patterns in humans
ribosome binding sites
TF binding sites
nuclease binding sites
Give reasons why motif finding in eukaryotes is significantly harder than in prokaryotes
Prokaryotes have few TFs
Prokaryotes have operons
Eukaryotes have more TFs per gene
Eukaryotes have shorter motifs
Eukarytes can have long range effects due to DNA folding
Eukaryotes have non-exact matching mechanism
What does the height of a lettter in a sequence logo encode ?
Why does there need to be a small sample correction ?
For which else does a sequence logo need to be corrected ?
Height shows information content at that position ([0,2])
If all NUCs are equally represented information is 0
With small samples (especially if sample number not divisible by 4) a information content of 0 is impossible even though it might be true.
pseudocounts try to adress this by adding 1 for each base in all cases
Different background propabilites
Why is regex unsuited for motiv discovery ?
Regex finds all the patterns, but has no way of distinguishing good from bad hits
What are the componetns of a hidden markov model and how is it decoded ?
Components:
emission propabilities
state chagne propabilities
states
The run an hmm is decoded using dynamic programming and the Verterbi algorithm
How does the expectation maximization algorithm work ?
Make an inital guess about the motif
Estimate the propability for being the motif at each position in each sequence
Compute new propabilites based on the propable hits (factoring hit propablity and nuc frequency)
Update the original motiv mode
Initalization is Key, Gibbs sampling tries to use statistical methods to improve EM
What is the basic workflow for prediction Genes on new procaryotic DNA ?
Translate DNA into all 6 ORFS
use gene prediction programm
Analyze regulatory sequences
Whats the job of the Ribosome?
How does mRNA find the Ribosome ?
Consists of proteins and RNA
Translates mRNA into proteins
The ribosomal binding site is also transcribed from DNA into mRNA
Why is gene finding in Prokaryotes relatively easy ?
The have no introns (uniterupted ORFS)
They have short intergenic regions
They have operons ( multiple genes with the same promoter and regulatory sequence)
They have dense genomes
Give 3 different approaches to Gene prediction
Content based (uses bulk properties of seq)
e.g ORFS
Codon usage
Repeat periodicity (less repeats in genes)
Site based
Identification of sites that are typical for genes
Donor acceptor sites
ATG starts
comparative
Compare the sequence with other genomes and see whether an annoation already exists
What can be said about the length of genes and how can this property be of use for prediction ?
Genes are usually long ORFs
By change we would expect a stop codon every ~ 21 codons (64/3)
Short putative genes are propably false positives
What info does the an intrinsic approach use for gene prediction ?
Give one example
Information that is contained in the sequence
Unequall AA usage
Condon perferences (different from organism to organism)
Unequal codon usage
HMMs model the grammar of genes
e.g every third NUC is the same (AA is mainly encoded by first 2 codon NUCs)
assymetry scores are converted to propablilites
Whats the issue with training gene prediction on bulk data ?
Model are insensitive
class specific models increase prediction performance
Which tool uses extrinsic and intrinsic info for prokaryotic gene prediction ?
How does it work ?
ORPHEUS
DPS local alignments (DPS = DNA protein search)
Codon usage analysis (extend local alignments to start/stop condons)
RBS weight matrix ( align -20 to -1 regions and identify ribosome binding sites)
finde the best gene start
Whats the Shine-Dalgarno sequence
the ribosomal biding site
Purine rich region at 5’ end
AGGAG
5-10 bp upstream
What are some prediction tasks on an eukaryotic genome ?
promoter prediction
splice site prediction
gene prediction
Whats the codon bias ?
One AA is primarily encoded by one codon even though there are multiple others.
Do introns split after a codon or not ?
They split wherever they want ( also mid codon)
Name the 3 most important sites of an intron and decribe the splicing process
5’-splice site <exon>GT……<intron>
branch point <intron>……A…….<intron>
3’ splice site <intron>…AG<exon>
The 5’-site is cleaved and loops to the branch point and forms a loop.
The 3’-site is cleaved
Where does the splicing happen ?
spliceosome
Describe how GENESCAN works
Desinged to predict complete gene structures
based on a general propabilistic model of genomic sequence composition
based on a GHMM (generalized HMM)
models 2 strands simultaneously
Emissions are e.g:
N: intergenic region
P: promoter
E: exon
A: poly a tail
Whats an EST ?
expressed sequence Tag
expressed RNA the has been retranscribed into CDNA and added into library
Whats the exon chaining problem ?
Given a set of putative exons, find the maximun set of non overlapping sequences
Dynamic programming in O(n)
What are some similarity based approaches for gene finding ?
Comparisson to EST databases ( are my ESTs already in a database)
Comparisson of translated seq to known protein sequences
Comparisson to homologic annotated sequences of related organisims
Whats a CIS / TRANS alignment ?
CIS: alinging CDNA to its source genome
TRANS: alinging CDNA to a homologous genome
When doing dual-genome de novo gene prediction what the most important decision to make ?
The choice of the informant genome
To similar ( no new information gain )
To distance ( no simialrities to infer from )
Ideal seq identiy for informant 55% match
Name some metrics to measure genome prediction success
Sensitivity (How many of a all TPs did I get)
Specificitiy (Whats the percentage of TP out of all positives)
Missed Genes ( how many genes didn’t i find)
Wrong Genes (Where did I predict a gene if there wasn’t any)
Joined Genes ( did I join 2 genes that are not connected)
Split Genes ( did I fail to concat 2 exons into one gene)
What are pseudo genes and what are the 2 types ?
Nonfunctional genes, derived from functional genes
have degenerative features, that prevent expression
differ from paralogous at crucial points
Types:
conventional
processed (at least 8000)
How are conventional pseudo genes created ?
Gene duplication
One copy is released from evolutionary pressure
Aquires lots of mutations
either new gene with similar function
conventional non-functional pseudo gene
How are processed pseudogenes created ?
Reinsertion of mRNA into DNA
Has no:
introns
signaling sequences (TFs, promotors)
Why do pseudo genes need to be analysed ?
high similarity to func genes can interfere with:
PCR
in situ hybridization
Provide a molecular record of genome evolution
processed / retrotransposed pseudo genes can be used to verify exon struct predictions
How can Alternate splicing be seen experimentally
RealTime - PCR using primers that flank AS region
run product through gel and check length
Microarrays
probes are exon-exon junctions
denpending on which exon-exon junctions a signal is seen the splice product can be infered
What has the analysis of AS products on protein structure revealed ?
half of all splicing events affect variable regions or entire domains
other half is non-tirvial
affect conserved and strucutred regions
=> nonsense products
=> new functions
Name 3 classification types for processed pseudo genes
TRUE:
high seq similarity with swiss-prot / trembl seq
Putative:
young pseudogenes without frame distruptions
Distrupted
What is the Ka / Ks ratio ?
Ka: rate of non-synonymous rate of substitution
Ks : rate of synonymous substition
Ka/Ks << 1 “purifying selection”
Ka/Ks >> 1 “diversify protein product” e.g immune system genes
Processed pseudo genes Ka/Ks ~1
What is positive / negative splicing control ?
positive : binding of an activator enables splicing
default no splicing
negative: binding of an repressor inhibits splicing
default splicing
Name 2 splicing categories and what they describe
constitutive:
more than one splicing product is always made from a transcribed gene
regulated:
different products are generated at different times, under different condtions
Is every splice site consitstenly used ?
No some splice sites are only used some of the time
Name some alternate splicing events and their propablility
Retained intron 3%
Competing 5’-splice sites 18%
e.g exon has 2 5’splice sites, only one is used ==> different exon length
Competing 3’-splice sites 8%
Exon skipping 38%
Mutally exclusive exons
What does computational splice site identification rely on ?
EST libraries
better library (coverage) => better prediction
Whats a meassure for dissimilarity between mRNA isoforms ?
Splice junction difference (SJD)
Number of splice junctions that are not on both isoforms / number of all junctions
Whats the relationship between alternate splicing and evolutionary conservation ?
evolutionary splicing is conserved
exons that are not part of constitutive splice forms are mostly not conserved
Name the 3 categories of DNA sequence variations
SNPs (singel nucleotide polymorphisms)
Simple Tandom Repeat Polymorphisms
Insertions / Deletions
Do SNPs occur equally often accross the whole genome ?
No they occur less often in CDS
But they can occur anywhere:
coding regions
regulatroy regions
intergenic regions
Whats the primary discovery method for SNPs ?
sequenceing and subsequent alinment to references genome
What are sysnonymous and non-synonymous changes ?
Synonymous do not change the AA that is encoded by codon
Non-synonymous change the AA that is encoded by codon
Why is a SNP effect prediction model very usefull ?
experimental charcterization of SNPs is:
expensive
time consuming
difficult
Name 2 broad approaches to predict SNP effects
homology based ( mutations on conserved positions are likely to have a negative effect)
structure based ( does the mutation meaning fully change the structure)
Name some possible disruption to the functionality of Proteins that SNPs can cause
Size change in hydrophobic core ( new AA has larger side chain)
Introduction of buried charged residues
protein-protein interaction
Interferenece with DNA binding
Mutation of catalytic residues
What does SIFT do and why is it fast
SIFT = sorts intorlerant from tolerant
SIFT does a blast search for a given sequence
Gets tolerated / deletrious substitions for every pos
Substitions with norm. props less than a cutoff are decalred deletrious
Fast because it uses homolgy rather than structure
Name some types of non coding RNA
miRNA : translational regulation
siRNA : RNA interference
tRNA : transfer RNA
Why is a tRNA necessary ?
Codons don’t recogize the AA they encode
They recognize the anti codon of a tRNA which brings the AA
Name the main areas of a tRNA and highlight which are most essential
Curcial unpaired regions:
3’end - AA bidning
Anticodon loop - Contains anitcodon for matching
Other features:
D loop
T loop
Is there one tRNA for each AA or are there any other mechanisms?
One AA can have mutliple tRNAs
Some tRNAs can base pair with more than one codon
Some tRNAs only require accurate matching at the first 2 NUCs of a codon and can wobble on the third
Give RNA basic pariings
A-U
G-C
G-U (wobble pairing, used to form loop in splicing)
Name some secordary structures of RNA
single stranded
double stranded
hairpin loop
buldge loop
Which secondary RNA structure is the hardest to predict
Pseudo knots
Why do primary sequence based method not quite work for RNA?
Evolutionary pressure is on base parinings not sequence
Methods need to capture primary and secondary information
Give the 2 approaches used to predict RNA secondary structure
Energy minimization
pro: accomondation of experimental and alignment data
con: not tertiary struct, computationally intensive
use patterns of covariation
pro: simple
con: database has to be large enough
Give an RNA secodary prediction algorithm that maximizes base pairings ?
Dynamic programming approach
memory intesive
might not create most energetically favorable sturct
What are miRNAs ?
familiy of 21-25 small RNAs
alter the expression of genes in a seq-dependent manner
How are miRNAs created in a cellular environment ?
pri-miRNA is processed at the Drosha/Pasha complex
pre-miRNA is exported from nucleus to cytoplasm (Exportin 5)
pre-miRNA is cleaved by DICER
one pre-miRNA strand is incorporated into RISC
Whats MirScan and how does it work ?
Program for identifiying hairpin loops in C.elegans
How it works:
Slide a 110-nt window along genome (discard ovious non miRNA)
compute secondary struture for window with RNA fold
Identify potential hairpin loop candidates
Use scoring criteria on candidates to filter
Whats multiplicity and cooperativitiy in miRNAs ?
multiplicity:
one miRNA can traget multiple genes
cooperativity:
one gene can be regulated by many miRNAs
Name some basic types of repeats in the human genome
tandemly repeated DNA
LINE
SINE
What are mirco and mini statelites ?
Mircrosatelites: Repeats with a repeating sequence between 1-12 bp
Can be formed by replication slipage
Minisatelites: 12-500
What are satelite repeats and how do they differ from micro/mini statelites ?
confined to well defined region (e.g telomeric region)
Can span millions of bp and is species specific
Why should repeats be studied ?
repeats are believed to play a significant role in genome evolution and disease
Mobile elements may contain coding regions that are hard to distinguish from other types of genes
repeats induce local alignments complicating assembly
Describe how the k-mer approach for de novo repeat finding works
WTF there is no explanation
Comments:
Anton - hahaha ist Khmer nicht einfach die Suche nach Motiven der Länge k mit n mismatches
Benji - ja aber wan das fürn simpler aaproach
Anton - ich hätte gesagt einfach drüber gehen und schauen welche überrepräsentiert sind für Länge k
What is RepeatMasker
program that uses precompiled libraries to find known repeats
indispensable in genomes with analyzed repeat families
new genome => new library
Whats the difference between class I and class II transposable elements ?
Class I:
first transcribed from DNA
reverse transcribed into DNA by TE-encoded RT
LTR LINE
Class II:
Move by conservative cut and paste method
Exercise: What is meant by the N50 and L50 numbers in ncbi summaries ?
N50:
N50 is the shortest contig length that needs to be included for covering 50% of the genome.
e.g 4,641,652
L50:
L50 is the count of smallest number of contigs whose length sum makes up half of genome size.
e.g 1
Both are only relevant if there are multiple contigs
Exercise: Whats a pitfall when trying to translate a DNA sequence ?
using the wrong codon table
Exercise: Whats the MEME suite ?
A software suite for motif:
discorvery
enrichment
scanning
Exercise: Whats the CG-content and how is it calculated ?
(counts of G + Counts of C) / seq length
Exercise: Has this seq an ORF on the opposite strand
5´-TCAGCGTTTCAT-3´
Opposite strand : 3´-AGTCGCAAAGTA-5´
5´-ATGAAACGCTGA-3´ ATG … TGA
Exercise: Whats GeneMark ?
family of gene prediction programs
Exercise: What’s the Codon adaptation Index
Used to estimate Codon usage bias quantitaively
For each AA:
for each codon the ratio between itself and most frequent codon is computed
=> most abundant codons have relative adaptiveness of 1
CAI is geometric mean over weigths associated with each codon
Exercise: What happens if GeneMark is run twice on the same sequence but with different species ?
The gene locations stay the same
Classes change
Exercise: What is ASPic
A database for predicted alternate splicing isoforms
Exercise: What does the color denote in graphical output of RNA fold
base-pairing propabilities
Exercise: What is breadth and depth in the context of sequencing coverage ß
breadth: How much of the reference genome was seen during sequenceing
depth: How often was each Nuc sequenced
Explain Clone by Clone sequencing
Genome is broken into chunks of 150000 bp
Chunks locations are mapped to genome
Chunks are inserted into BACs and grow in bacterial cells
Shotgun sequence amplified BACs
Which 2 simple repeats complicate genome assembly ?
tandemly repeated repeats
genome wide repeats
Name some genetic parts that are encoded in a genome
genome sequence variations
protein-coding genes
RNA-coding genes
pseudo genes
promotors / terminators
regulatory elements e.g binding motifs
Name some factors that influence TF binding
stochasitc (are DNA and TF close by chance)
affinity = structural / seq match
complete match / high affinity is not always desireable
Whats the PROSITE DB ?
collection of known DNA motifs from protein sequences
Whats the loggs odds score and why is it needed ?
The propability of a seq x decreases with length
gets infinitely small
Log-odds factor in background distributions and compare seq to random seq as background
Using logs allows for summation
What are some problems for motif discovery ?
The sequence is not known
start is not known
motifs differ from occurance to occurance
how do I discren from random motifs
Whats a consensus string ?
A seq of nucleotides that occur most often at their given position
hard to evaluate how good consensus really is
Explain the gibbs sampler
Gibbs sampler is a stochastic implementation of the EM algorithm
Sketch a rough gene prediction flowchart
Obtain new genomic DNA
Translate into all 6 ORFs and compare to protein seq db
Perform EST db search
Use gene prediction programm
Analyze regulatory sequence of the gene
What are the 3 stop codons ?
UAA
UAG
UGA
Name the 3 different RNA polymerases
RNA polymerase 1 = rRNA
RNA polymerase 2 = mRNA
RNA polymerase 3 = tRNA
How many FPs would one get from simply predicting splice sites with AG GT and what can be done to improve this ?
30-100 FPs for every true one
include surrounding sequence into prediction
Give an approach for exon prediction that has 100% sensitivity
report everything that is flanked by AG and GT
The specificity is 0
Whats Target scan and what does it do ?
Thermodynamic based modelling approach for RNA: RNA duplex interactions
Input:
miRNAs from multiple organisms
orthologous 3’ UTR sequences
How does miRNA regulate gene expression and which is more common ?
translational repression
RISC with miRNA binds to mRNA and prevents translation
mRNA cleavage
mRNA is cleaved by RISC
translational repression is more common
Name some MirScan ranking criteria
base pairing porpability sum till 21-nt
base parining propablitiy sum from 21-nt
5’ conversvation
3’ conservation
buldge symetry
How many miRNAs does a Human have ?
2300
Whats the difference between interspersed repeats and tandemly repeated DNA
tandemly repeated DNA are sequential grouped together repeats
interspersed repeats are spread accross chromosomes and the entire seq
Whats the difference between retroelements and DNA transposons ?
Retroelements reproduce by reverse transcription followed by integration into DNA
DNA transposons are capable of integrating and excising themselves ( cut-and-paste)
Name the 3 different mechanisms for transposition
Conservative tranposition ( no copy left behind )
Replicative transposition ( copy left behind )
Retrotransposition (RNA intermediary, copy left behind)
Sketch the classification of repeated seuqences
What reputor and what are its Pros and Cons ?
Programm to determine exact repetivie substrings in comple genome
Idea: exact matches are core of approximate repeats
Pro: short running time o(n)
Con: only exact matches
Whats the Idea behind Repeat finder ?
given all exact repeats
define repeat class by merging and extending them
Name some different types of NGS
Seq by sythesis (Ilumina,Roche)
Ilumina: shine laser in marked Nuc
Roche: pyrophosphate release produces light
Ligase-mediated sequencing (Applied biosystems)
Single molecule real Time seuenceing (PAC bio, Oxford nanopore)
no library applification
Why are seuqenced reads cut and what strategies exist ?
read quality deteriorates to wards the end of read
Fixed length cut off
Adaptive trimming (Quality socre cutoff)
What are barcodes for ?
To distinguish different samples when multiplexing
What are common NGS errors ?
Read errors
Base calling errors
small insertions / deletions
Can DNA seq influence Methylation ?
YES, DNA motifs are involved in regulating DNA methylation
Which programm is used to evaluate the quality of genome assemblies ?
BUSCO
Whats a PAN genome ?
no single reference genome
A set of reference genomes is a Pan genome
Name to minor pseudo gene classes
Unitray Pseudogenes. (deactivated gene, no functional copy)
Polymorphic ( exhibit variation within population
Last changed5 months ago