undefined

Buffl

Genome Analysis

by Mista F.

What are split genes?

—> genes with introns and exons

What are the three gene finding strategies?

content based: ORFs, codon usage,…
site-based: TF binding sites, polyadenylation signals
comparative: protein similarity (databases)

Briefly describe the eukaryotic gene structure.

Describe the three different RNA polymerases of eukaryotes.

RNA-Pol I: transcribes genes encoding ribosomal RNA
RNA-Pol II: transcribes genes encoding mRNA (and certain small nuclear RNAs)
RNA-Pol III: transcribes genes encoding tRNAs and other small RNAs

Describe splicing of pre-mRNA.

step
- cleavage at the 5′ splice site (SS)
- 5′ end of the intron + A within the intron (the branch point) —> lariat-like intermediate (loop)
step
- cleavage at the 3′ splice site
- ligation of the exons

—> only exons, introns cut out -> in loop

What are some challenges in finding genes in eukaryotes?

require correct reading frame
- introns can interrupt exon in mid-codon
no golden rule for identification of donor & acceptor sites (signals are very weak)

Define sensitivity and specificity.

Sens = TP/(TP+FN) -> FN
Spec = TP/(TP+FP) -> FP

Name two different method classes to gene prediction

”Isolated” methods —> Predict individual features
- e.g. splice sites (NetGene)
”Integrated” methods —> Predict genes in context
- ”Grammar” of genes
- Certain elements in specific order are required
  - HMMgene
  - GenScan

Sketch the GenScan algorithm.

—> Both strands at the same time

N —> intergenic
P —> promoter (sensor for TATA)
F —> 5’ UTR
T —> 3’ UTR
A —> poly-A signal (sensor for Cap signal)
E —> exons
- sngl —> single exon
- multiple:
  - init —> first exon
  - I 1-3 —> introns
  - E 1-3 —> exons between introns
  - term —> last exon

What are isochores?

very long stretches of DNA
homogeneous in base composition
different families —> GC content (30-60%)

Explain and sketch GeneMarkS.

parallel unsupervised training and prediction
based on GeneMark.hmm architecture:
- non-homogeneous HMM -> coding regions
- homogeneous HMM -> non-coding regions
- coding capacity of sliding windows -> Bayesian decision rule

GeneMarkS architecture:

GeneMark.hmm —> “coding” / “non-coding”
update parameters
run GeneMark.hmm
repeat until convergence

What is the minimal genome size required to efficiently perform automatic training of GeneMarkS?

10 MB

Define the two types of alignment.

cis: align cDNA to best match locus in source genome
trans: align cDNA or protein seq to homologous locus outside of source genome

Which two classes of information are used in gene prediction?

intrinsic:
- exon/intron length distribution
- promoter and polyA signals
- isochore differences
- conserved splice signals
- hexamer composition of exons/introns
- reading frame consistency of exons
extrinsic
- EST
- cDNA
- protein-genome alignments

How does the positive prediction value of N-SCAN change if the target closely resembles the informant in the prediction of Drosophila melanogaster?

Genvorhersage für D. melanogaster:

too diverged → number of mismatches high because most of sequence can not be aligned
too close → number of mismatches low because sequence is unchanged
for D. melanogaster best acc. with using D. ananassae with ~1 substitution per synonymous site
for Human mouse would be a good informant (~0.6 substitutions per synonymous site)

How can you measure gene prediction success?

by nucleotide
- Sensitivity/Specificity
by exon
- Sensitivity/Specificity
- Missed Exons (ME), Wrong Exons (WE)
by gene
- Sensitivity/Specificity
- Missed Genes (MG), Wrong Genes (WG)
- average overlap statistics

Briefly describe the GFF file format.

seqname: chromosome
source
feature
start
end
score
strand
frame
attributes

Join Course

Preview

Author

Mista F.

Information

Last changed
a year ago

Report course

05 - Eukaryotic Gene Prediction

Author

Mista F.

Information