What are split genes?
—> genes with introns and exons
What are the three gene finding strategies?
content based: ORFs, codon usage,…
site-based: TF binding sites, polyadenylation signals
comparative: protein similarity (databases)
Briefly describe the eukaryotic gene structure.
Describe the three different RNA polymerases of eukaryotes.
RNA-Pol I: transcribes genes encoding ribosomal RNA
RNA-Pol II: transcribes genes encoding mRNA (and certain small nuclear RNAs)
RNA-Pol III: transcribes genes encoding tRNAs and other small RNAs
Describe splicing of pre-mRNA.
step
cleavage at the 5′ splice site (SS)
5′ end of the intron + A within the intron (the branch point) —> lariat-like intermediate (loop)
cleavage at the 3′ splice site
ligation of the exons
—> only exons, introns cut out -> in loop
What are some challenges in finding genes in eukaryotes?
require correct reading frame
introns can interrupt exon in mid-codon
no golden rule for identification of donor & acceptor sites (signals are very weak)
Define sensitivity and specificity.
Sens = TP/(TP+FN) -> FN
Spec = TP/(TP+FP) -> FP
Name two different method classes to gene prediction
”Isolated” methods —> Predict individual features
e.g. splice sites (NetGene)
”Integrated” methods —> Predict genes in context
”Grammar” of genes
Certain elements in specific order are required
HMMgene
GenScan
Sketch the GenScan algorithm.
—> Both strands at the same time
N —> intergenic
P —> promoter (sensor for TATA)
F —> 5’ UTR
T —> 3’ UTR
A —> poly-A signal (sensor for Cap signal)
E —> exons
sngl —> single exon
multiple:
init —> first exon
I 1-3 —> introns
E 1-3 —> exons between introns
term —> last exon
What are isochores?
very long stretches of DNA
homogeneous in base composition
different families —> GC content (30-60%)
Explain and sketch GeneMarkS.
parallel unsupervised training and prediction
based on GeneMark.hmm architecture:
non-homogeneous HMM -> coding regions
homogeneous HMM -> non-coding regions
coding capacity of sliding windows -> Bayesian decision rule
GeneMarkS architecture:
GeneMark.hmm —> “coding” / “non-coding”
update parameters
run GeneMark.hmm
repeat until convergence
What is the minimal genome size required to efficiently perform automatic training of GeneMarkS?
10 MB
Define the two types of alignment.
cis: align cDNA to best match locus in source genome
trans: align cDNA or protein seq to homologous locus outside of source genome
Which two classes of information are used in gene prediction?
intrinsic:
exon/intron length distribution
promoter and polyA signals
isochore differences
conserved splice signals
hexamer composition of exons/introns
reading frame consistency of exons
extrinsic
EST
cDNA
protein-genome alignments
How does the positive prediction value of N-SCAN change if the target closely resembles the informant in the prediction of Drosophila melanogaster?
Genvorhersage für D. melanogaster:
too diverged → number of mismatches high because most of sequence can not be aligned
too close → number of mismatches low because sequence is unchanged
for D. melanogaster best acc. with using D. ananassae with ~1 substitution per synonymous site
for Human mouse would be a good informant (~0.6 substitutions per synonymous site)
How can you measure gene prediction success?
by nucleotide
Sensitivity/Specificity
by exon
Missed Exons (ME), Wrong Exons (WE)
by gene
Missed Genes (MG), Wrong Genes (WG)
average overlap statistics
Briefly describe the GFF file format.
seqname: chromosome
source
feature
start
end
score
strand
frame
attributes
Last changed4 months ago