Roughly describe a procaryotic gene prediction workflow
Obtain new genomic DNA sequence
Translate all 6 ORFs and compare them to the protein database
Similarity search of expressed sequence tag (EST) DB of the same organism, or cDNA sequences if available
Gene prediction tool to locate genes
Analyse the regulatory sequence in the gene
What is gene prediction?
Identifying the regions of genomic DNA that encode genes
Name the start and stop codons
start: ATG
stop: TAA, TAG, TGA
(A can also be U)
What are the two main types of evidence for coding regions?
Intrinsic:
sufficient length of ORFs, long ORFs rarely occur by chance
special codon patterns in comparison to non-coding regions
ribosomal binding sites upstream of start codon
extrinsic:
Similarity to known gene products
What is an open reading frame (ORF)?
sequence of DNA -> can be translated into protein
starting with start codon
ending with stop codon
What is a Shine-Dalgarno (SD) sequence?
Purine-rich sequence in bacterial mRNA —> recognized by the ribosome for translation initiation
5-10 basepairs upstream from the start codon
What is the Pribnow box?
conserved sequence found in bacterial promoters
typically located at -10 relative to the transcription start site (upstream)
What role does the ribosome binding site (RBS) play in gene expression?
= sequence upstream of start codon
—> helps direct the ribosome to correct translation start position
What is the significance of codon usage bias in gene prediction?
Codon usage bias refers to preference for certain codons over others in coding sequences
—> can help distinguish coding regions from non-coding regions
Describe approaches to gene prediction from genomic sequence
What is a hidden markov model (HMM)?
= statistical model —> represent probability of sequences
particularly useful for identifying gene structures in genomic data
Name some intrinsic approaches to gene prediction
GeneMark
non-homogenous HMMs for coding regions
homogenous HMMs for non-coding regions
coding capacity of sliding windows is deduced through a Bayesian decision rule
GLIMMER
interpolated HMM
takes into account DNA oligomers of varying lengths dependent on the local composition of the sequence
EcoParse
HMM
maximum likelyhood of a sequence in coding and non-coding
no sliding window
Explain why there are 6 possible ORFs in a bacterial genome
—> 3 ORFs per strand
Name some statistical properties of coding regions
Factors for unequal codon usage in coding regions:
Unequal AA usage
Unequal number of codons for different AA
codon preference
Factors for a coding RF1:
AA composition in RF1-3
Codon composition of RF1-3
Positional base frequency: freq. with which each of the four bases occupies each of the three positions within codons
How are HMMs used for gene finding?
able to model “grammar”
words are codons
How does a HMM generate a sequence of nucleotides?
how does one find the most likely random walk?
random walk starting in the middle of any of the HMMs
Choosing one of the 61 codon models repeatedly results in a 'random gene‘
Gene termination
Intergenic region
Start codon HMM
Transition is made back to the central state and the whole process repeated
—> Result: a sequence of nucleotides that is statistically similar to a contig of E. coli DNA (or any other training set) consisting of a collection of genes interspersed with intergenic regions
Most likely random walk:
Viterbi algorithm
find the most probable sequence of gene structures (hidden states) that explains the observed nucleotide sequence
identification of coding regions, introns, exons, and other genomic features
Name the three different gene classes in E.Coli and list some properties
Class I genes
intermediate codon usage bias
low/ intermediate level of expression
environmental triggers —> high expression of some genes (rare)
Class II genes
high codon usage bias
highly expressed under exponential growth conditions
Class III genes
low codon usage bias
plasmids and insertion sequences
can be expressed at a fairly high level
includes genes coding for fimbriae, major pili, membrane proteins, …
For what purpose are extrinsic approaches used? Give an example
used to verify intrinsic methods
e.g. ORPHEUS
Last changed4 months ago