Homologs
general term for genes/proteins that have similar sequence and are derived from a common ancestral sequence
Orthologs
homologs derived through speciation
found in different species
perform similar functions
Orthologs arise from speciation events where a gene is passed down to descendant species
Example:
human alpha-globin and chimpanzee alpha-globin are orthologs. (Human alpha-globin and human beta-globin are paralogs.)
Human alpha-globin is more similar in sequence to chimpanzee alpha-globin than it is to human beta-globin.
-> In this case, gene duplication predates speciation.
Paralogs
homologs derived through gene duplication
exist within the same species or lineage
result of gene duplication events, where a gene is copied within the genome
duplicated copies evolve independently -> acquire new functions or diverge in function while still being related to the original gene
Analogs
Genes with similar sequence due to convergent evolution
not common ancestry
Functional analogs: proteins that have non- homologous sequences but perform the same molecular function
Homology search
matching a given sequence to other known genes or proteins in a database
genome sequence has been determined + the genes predicted -> next step: Homology search
many complete genomes have been sequenced -> it is common to compare one genome to another and determine how many genes they have in common
For distantly related species, it is often impossible to distinguish orthologs and paralogs
Tool: BLAST
BLAST
+ For protein sequences
BLAST – Basic Local Alignment Search Tool
The most commonly-used homology search tool
Does not align complete sequences, but finds subsequences with the best possible alignment
For protein sequences:
identical: amino acids are the same
positive: amino acids may be different, but have similar biochemical properties (size, charge)
BLAST-Scoring
Matches are scored by an E-value
E = the number of matches expected at random when searching a database of this size with a query of this length
E = 1 means 1 match would be expected at random from the database
E-value is similar to a probability (P-value)
E = 10–6 means that there is only a 1 in a million chance of observing such a match at random
The lower the E-value, the greater the confidence that the sequences are homologous
BLAT
BLAST-like alignment tool
For quick searches of the genome, genome browsers, such as the UCSC browser (http://genome.ucsc.edu/), use BLAT
It is similar to, but not the same as, BLAST
BLAT uses a faster algorithm based on 11-mers (=11 bases of DNA) or 4-mers (=4 amino acids) to find matches of:
95% or greater identity over 25 bases or more (DNA)
80% or greater identity over 20 amino acids or more (Proteins)
alignment tools
very quickly map “short reads” (typically DNA sequences in the range of 35-250 bases) to a reference genome
used for next generation sequencing data.
Typically, they align complete sequences and expect a very close match with the reference sequence.
Examples: BWA, bowtie, Stampy, NextGenMap
Distinguish orthologs and paralogs from homology searches
Molecular evolutionists/systematists are usually interested in comparing orthologs.
difficult to distinguish orthologs and paralogs from homology searches
Method: “reciprocal best hits” is often used
Another approach: consider only “one-to-one” orthologs
defined as homologous genes that occur in only a single copy in each genome
“reciprocal best hits”
Method: “reciprocal best hits” is often used to distinguish orthologs and paralogs from homology searches
Compares genes in two genomes using two steps:
1. Gene A from species 1 is used for a BLAST search of the species 2 genome. The best match (or “hit”) is gene A’
2. Gene A’ from species 2 is then used for a BLAST search of the species 1 genome
-> If the best match is gene A, then these are reciprocal best hits and are considered orthologs
the above method does not guarantee that genes A and A’ are true orthologs
the result could be misleading if there are independent gene duplication/loss events in the two species
Prokaryotic Comparative Genomics
Over 44,000 complete prokaryotic genomes are publicly available, and this number is increasing.
For up-to-date numbers and information, see: http://bacteria.ensembl.org/index.html
Prokaryotes are divided into two major domains (or empires) of life, which allows very diverse comparisons:
a) Bacteria (or Eubacteria) - common commensal and pathogenic bacteria
b) Archaea (or Archaebacteria) - ancient group of mostly extremophiles (live at high temp., etc.)
First comparative genomic studies
two closely-related pathogenic bacteria:
Mycobacterium leprae (causes leprosy)
Mycobacterium tuberculosis (causes tuberculosis)
M. leprae
M. tuberculosis
coding genes
1,604
3,959
pseudogenes
1,116
6
M. leprae appears to have lost the function of about half of its genes
-> This may explain why it has the longest doubling time of any known bacteria and why it cannot be cultured in the laboratory
(only grows in host cells)
Pseudogenes
previously protein-encoding genes -> mutations that disrupt their coding sequence
(insertion of a stop codon or insertion/deletion o fbases that causes a frameshift in the coding region)
Gene Loss in Pathogenic Bacteria
group of pathogenic bacteria (genus Mycoplasma) has the smallest genomes known in free-living bacteria (580 Kb Mycoplasma genitalium, 480 Protein Coding genes)
represents an evolutionary derived state
NOT primitive bacteria that have gained only the genes necessary for survival
BUT their ancestors had many more genes that were lost over the course of evolution.
Which genes are lost?
Many genes involved in energy metabolism are lost
Metabolic intermediates and energy sources are taken from the host
Many genes required for amino acid and vitamin synthesis are lost -> provided by the host
Exception: aphid symbiont Buchnera aphidicola:
10% of its genes are involved in synthesis of essential amino acids not synthesized by the host.
BUT, it has lost the genes needed to synthesize amino acids that are made by the host -> true symbiont.
The smallest endosymbiont bacterial genome
genus Carsonella (160 Kb, 182 genes)
Similar to Buchnera, these bacteria live in sap-eating insects, which have a low-protein diet
Over half of the genes in the Carsonella genome are involved in translation and amino acid metabolism
Why are genes lost?
a) Selective advantage for smallness?
Bacteria with smaller genomes can replicate faster and outcompete those with larger genomes that replicate slower. This does not appear to be the case:
small changes in DNA content (gene-sized) do not appear to affect replication rate
many pathogens retain non-functional pseudogene DNA (for example, M. leprae)
small genomes are not more densely packed than large (same amount of intergenic DNA)
b) Mutation pressure
The major reason for gene loss is thought to be mutational
No selective pressure to maintain a gene -> it will eventually be lost due to mutation. Consistent with this:
- there is a bias towards deletions
- there is a bias towards mutations to A or T
Thus, obligate pathogens and endosymbionts tend to have small genomes and high %AT
Hyperthermophile Comparative Genomics: stages
hyperthermophile: live at 80–100 C, mostly Archaea, some bacteria
thermophiles: live at 50–65 C, mostly Archaea, some bacteria
mesophiles: live at under 50 C, mostly bacteria, some Archaea
Are there specific genes that allow survival at very high temperatures?
What is important to consider?
tested by searching the COG (Clusters of Orthologous Groups) database: http://www.ncbi.nlm.nih.gov/COG/ for proteins present in hyperthermophiles, but not in thermophiles or mesophiles
Important: consider evolutionary relationships in the analysis
For example, there are some Archaea that do not live at high temperatures and some bacteria that do live at high temperature
-> In other words, the presence of the gene should follow the temperature, not the phylogenetic relationship.
are there genes specific to hyperthermophiles?
The result: one protein out of 2,791 is specific to hyperthermophiles:
The protein was reverse gyrase, a large protein (>1000 a.a.) that contains two protein domains, helicase and topoisomerase
It introduces twists into double-stranded circular DNA and may help prevent unwinding of DNA at high temperatures (DNA is normally denatured at high temp.).
Two major questions of comparative genomics:
a) What is conserved? - What are the common requirements for eukaryotic life?
b) What is different? - What makes each species unique?
Comparison of eukaryotic model organisms
2000: Drosophila genome was completed
possible to look at gene conservation across three major eukaryotic model organisms, D. melanogaster (fly), C. elegans (worm), and S. cerevisiae (yeast)
Although the human genome had not yet been completed, many genes were already known from human and/or mouse (mammal), and these could also be compared
Overall, the highest proportion of shared genes was between mammals and fly, with about 50% of the fly genes giving a significant BLAST (E < 10-10) match to mammalian genes
About 35% of the worm genes and 37% of the yeast genes matched a mammalian gene.
Human disease genes in model organisms
Human genes known to be associated with disease from the OMIM (Online Mendelian Inheritance in Man) database were used as queries for protein BLAST searches of the fly, worm, and yeast genomes
A BLAST cut-off of E < 10-6 was used to define significant hits
Of 289 human disease genes:
230 (80%) were found in the fly
212 (73%) were found in the worm
120 (42%) were found in yeast
Conclusion: Model organisms, especially Drosophila, can be very useful for studying human disease.
Drosophila genome analysis
By 2007, the complete genomes of 12 different Drosophila species had been sequenced
Although these are all from the same genus, the sequence divergence among the species is about the same as that among mammal
About 7,000 genes had single-copy orthologs in all 12 species and almost all of these showed evidence for expression and lacked transposable element insertions
This may represent the “core” Drosophila genome
Another 5,000 genes showed homology across all species but were not single copy
That is, they were multi-gene families with multiple paralogs in different species.
The number of predicted “unique” or “lineage specific” genes in the different species ranged from hundreds to thousands, but many of these lacked evidence for expression, so it is difficult to determine how many are real, functional genes.
plant genomes - Arabidopsis thaliana
The first plant genome to be sequenced was that of the mustard weed, Arabidopsis thaliana, in 2000.
small genome
major model system for plant genetics
125 Mb; 25,000 genes
more genes than the fly (14,000) or the worm (19,000), and slightly more than human (21,000)
Compared to fly and worm, Arabidopsis has more genes that are present as paralogs in the genome
A higher percentage of the Arabidopsis genes are part of multi-gene families -> gene duplication has played an important role in plant genome evolution
Comparison of plant genomes
In 2002, the genomes of two strains of rice, Oryza sativa, were sequenced.
=> This allowed the first comparative genomic analysis in plants
Rice has a genome size of 430 Mb. The number of rice genes is around 40,000-50,000, depending on the method used for prediction.
Results:
80-85% of the predicted Arabidopsis genes had a homolog in rice. Only ≈50% of the predicted rice genes had a homolog in Arabidopsis.
Of the genes shared by Arabidopsis and rice, 30.5% had a homolog in yeast, worm, or fly.
Of the rice genes with no homolog in Arabidopsis, 2.4% had a homolog in yeast, worm, or fly.
Are Arabidopsis genes a subset of rice genes, similar to a plant minimum genome, while rice has evolved additional new genes?
Horizontal gene transfer
Are there bacterial genes in the human genome?
Were genes transferred directly from bacteria to humans (or other vertebrates)?
IHGP human genome paper: “hundreds of human genes appear likely to have resulted from horizontal gene transfer from bacteria at some point in the vertebrate lineage”.
they identified 223 human proteins that had significant homology to bacterial proteins, but no match in yeast, worm, fly, plant, or any other (non-vertebrate) eukaryote that had been sequenced at the time.
How is that possible?
Possibilities:
a) Contamination? Unlikely – 35 genes were tested in humans by PCR and are real. Many have orthologs in other vertebrates.
b) Genes present in common ancestor of eukaryotes, but lost in yeast, worm, fly, plant, etc. Requires many independent cases of gene loss, but is possible.
c) Could be transfer from humans to bacteria. 113 of the genes are found in many diverse bacteria species, so this would require many independent gene transfers.
d) The authors prefer the scenario that the genes were transferred directly from bacteria to vertebrates (at least 113 of the genes).
Since then, a number of criticisms and alternate explanations have been published.
general finding is that when more non-vertebrate eukaryote genomes are searched, homologs to these genes can be found.
This supports independent gene loss in some eukaryotes and argues against bacteria-to-human horizontal gene transfer.
phylogenetic analysis of the sequences can be used to test the hypothesis. If there was horizontal transfer, the human genes should be more similar in sequence to the bacterial genes than to other non-vertebrate eukaryotes.
This does not appear to be the case.
At present, the number of potential bacterial genes in the human genome has dropped below 40, and will likely decrease as more diverse eukaryotic genomes are sequenced.
Human/Mouse comparison
After human, the next vertebrate genome to be sequenced was that of the mouse.
A comparison of the human and mouse genomes revealed that at least 98% of the genes had homologs between the two species
On a small scale, the order of genes was well conserved, but on a larger scale, there were many chromosomal re-arrangements.
This means that mouse and human genes are found in conserved blocks (that is, the same genes in same order = “synteny”)
However, the chromosomal locations of these blocks are not well conserved.
For example, gene blocks from mouse chromosome 16 are spread over 6 different human chromosomes.
These types of re-arrangements are common among vertebrate species.
Other, unrelated pathogens also have small genomes
Rickettsia sp. – cause Rocky Mountain and Mediterranean spotted fever in humans (1.2 Mb)
Borrelia burgdorferi - causes Lyme disease in humans (1.4 Mb)
symbiotic bacteria of aphids, Buchnera sp., have a very small genomes (450–670 Kb); BUT, they can not live on their own, but only inside their host.
=> association with a host -> promotes genome reduction
Zuletzt geändertvor einem Jahr