1_Comparative Genomics

Buffl

Advanced EvoGen

by Lea H.

Homologs

general term for genes/proteins that have similar sequence and are derived from a common ancestral sequence

Orthologs

homologs derived through speciation

found in different species
perform similar functions
Orthologs arise from speciation events where a gene is passed down to descendant species

Example:

human alpha-globin and chimpanzee alpha-globin are orthologs. (Human alpha-globin and human beta-globin are paralogs.)

Human alpha-globin is more similar in sequence to chimpanzee alpha-globin than it is to human beta-globin.

-> In this case, gene duplication predates speciation.

Paralogs

homologs derived through gene duplication

exist within the same species or lineage
result of gene duplication events, where a gene is copied within the genome
duplicated copies evolve independently -> acquire new functions or diverge in function while still being related to the original gene

Analogs

Genes with similar sequence due to convergent evolution
not common ancestry
Functional analogs: proteins that have non- homologous sequences but perform the same molecular function

Homology search

matching a given sequence to other known genes or proteins in a database
genome sequence has been determined + the genes predicted -> next step: Homology search
many complete genomes have been sequenced -> it is common to compare one genome to another and determine how many genes they have in common
For distantly related species, it is often impossible to distinguish orthologs and paralogs
Tool: BLAST

BLAST

+ For protein sequences

BLAST – Basic Local Alignment Search Tool

The most commonly-used homology search tool
Does not align complete sequences, but finds subsequences with the best possible alignment

For protein sequences:

identical: amino acids are the same
positive: amino acids may be different, but have similar biochemical properties (size, charge)

BLAST-Scoring

Matches are scored by an E-value

E = the number of matches expected at random when searching a database of this size with a query of this length
E = 1 means 1 match would be expected at random from the database
E-value is similar to a probability (P-value)
E = 10–6 means that there is only a 1 in a million chance of observing such a match at random
The lower the E-value, the greater the confidence that the sequences are homologous

BLAT

BLAST-like alignment tool

For quick searches of the genome, genome browsers, such as the UCSC browser (http://genome.ucsc.edu/), use BLAT
It is similar to, but not the same as, BLAST
BLAT uses a faster algorithm based on 11-mers (=11 bases of DNA) or 4-mers (=4 amino acids) to find matches of:
- 95% or greater identity over 25 bases or more (DNA)
- 80% or greater identity over 20 amino acids or more (Proteins)

alignment tools

very quickly map “short reads” (typically DNA sequences in the range of 35-250 bases) to a reference genome
used for next generation sequencing data.
Typically, they align complete sequences and expect a very close match with the reference sequence.

Examples: BWA, bowtie, Stampy, NextGenMap

Distinguish orthologs and paralogs from homology searches

Molecular evolutionists/systematists are usually interested in comparing orthologs.
difficult to distinguish orthologs and paralogs from homology searches
Method: “reciprocal best hits” is often used
Another approach: consider only “one-to-one” orthologs
- defined as homologous genes that occur in only a single copy in each genome

Distinguish orthologs and paralogs from homology searches

“reciprocal best hits”

Method: “reciprocal best hits” is often used to distinguish orthologs and paralogs from homology searches
Compares genes in two genomes using two steps:
- 1. Gene A from species 1 is used for a BLAST search of the species 2 genome. The best match (or “hit”) is gene A’
- 2. Gene A’ from species 2 is then used for a BLAST search of the species 1 genome
  -> If the best match is gene A, then these are reciprocal best hits and are considered orthologs
the above method does not guarantee that genes A and A’ are true orthologs
the result could be misleading if there are independent gene duplication/loss events in the two species

Prokaryotic Comparative Genomics

Over 44,000 complete prokaryotic genomes are publicly available, and this number is increasing.

For up-to-date numbers and information, see: http://bacteria.ensembl.org/index.html

Prokaryotes are divided into two major domains (or empires) of life, which allows very diverse comparisons:

a) Bacteria (or Eubacteria) - common commensal and pathogenic bacteria

b) Archaea (or Archaebacteria) - ancient group of mostly extremophiles (live at high temp., etc.)

First comparative genomic studies

two closely-related pathogenic bacteria:
- Mycobacterium leprae (causes leprosy)
- Mycobacterium tuberculosis (causes tuberculosis)

	M. leprae	M. tuberculosis
coding genes	1,604	3,959
pseudogenes	1,116	6

M. leprae appears to have lost the function of about half of its genes

-> This may explain why it has the longest doubling time of any known bacteria and why it cannot be cultured in the laboratory

(only grows in host cells)

Pseudogenes

previously protein-encoding genes -> mutations that disrupt their coding sequence

(insertion of a stop codon or insertion/deletion o fbases that causes a frameshift in the coding region)

Gene Loss in Pathogenic Bacteria

group of pathogenic bacteria (genus Mycoplasma) has the smallest genomes known in free-living bacteria (580 Kb Mycoplasma genitalium, 480 Protein Coding genes)
represents an evolutionary derived state
NOT primitive bacteria that have gained only the genes necessary for survival
BUT their ancestors had many more genes that were lost over the course of evolution.

Which genes are lost?

Many genes involved in energy metabolism are lost
Metabolic intermediates and energy sources are taken from the host
Many genes required for amino acid and vitamin synthesis are lost -> provided by the host
Exception: aphid symbiont Buchnera aphidicola:
- 10% of its genes are involved in synthesis of essential amino acids not synthesized by the host.
- BUT, it has lost the genes needed to synthesize amino acids that are made by the host -> true symbiont.

The smallest endosymbiont bacterial genome

genus Carsonella (160 Kb, 182 genes)
Similar to Buchnera, these bacteria live in sap-eating insects, which have a low-protein diet
Over half of the genes in the Carsonella genome are involved in translation and amino acid metabolism

Why are genes lost?

a) Selective advantage for smallness?

Bacteria with smaller genomes can replicate faster and outcompete those with larger genomes that replicate slower. This does not appear to be the case:

small changes in DNA content (gene-sized) do not appear to affect replication rate
many pathogens retain non-functional pseudogene DNA (for example, M. leprae)
small genomes are not more densely packed than large (same amount of intergenic DNA)

b) Mutation pressure

The major reason for gene loss is thought to be mutational

No selective pressure to maintain a gene -> it will eventually be lost due to mutation. Consistent with this:

- there is a bias towards deletions

- there is a bias towards mutations to A or T

Thus, obligate pathogens and endosymbionts tend to have small genomes and high %AT

Hyperthermophile Comparative Genomics: stages

hyperthermophile: live at 80–100 C, mostly Archaea, some bacteria

thermophiles: live at 50–65 C, mostly Archaea, some bacteria

mesophiles: live at under 50 C, mostly bacteria, some Archaea

Are there specific genes that allow survival at very high temperatures?

What is important to consider?

tested by searching the COG (Clusters of Orthologous Groups) database: http://www.ncbi.nlm.nih.gov/COG/ for proteins present in hyperthermophiles, but not in thermophiles or mesophiles
Important: consider evolutionary relationships in the analysis
For example, there are some Archaea that do not live at high temperatures and some bacteria that do live at high temperature
-> In other words, the presence of the gene should follow the temperature, not the phylogenetic relationship.

are there genes specific to hyperthermophiles?

The result: one protein out of 2,791 is specific to hyperthermophiles:

The protein was reverse gyrase, a large protein (>1000 a.a.) that contains two protein domains, helicase and topoisomerase
It introduces twists into double-stranded circular DNA and may help prevent unwinding of DNA at high temperatures (DNA is normally denatured at high temp.).

Two major questions of comparative genomics:

a) What is conserved? - What are the common requirements for eukaryotic life?

b) What is different? - What makes each species unique?

Comparison of eukaryotic model organisms

2000: Drosophila genome was completed
possible to look at gene conservation across three major eukaryotic model organisms, D. melanogaster (fly), C. elegans (worm), and S. cerevisiae (yeast)
Although the human genome had not yet been completed, many genes were already known from human and/or mouse (mammal), and these could also be compared
Overall, the highest proportion of shared genes was between mammals and fly, with about 50% of the fly genes giving a significant BLAST (E < 10-10) match to mammalian genes
About 35% of the worm genes and 37% of the yeast genes matched a mammalian gene.

Human disease genes in model organisms

Human genes known to be associated with disease from the OMIM (Online Mendelian Inheritance in Man) database were used as queries for protein BLAST searches of the fly, worm, and yeast genomes
A BLAST cut-off of E < 10-6 was used to define significant hits
Of 289 human disease genes:
- 230 (80%) were found in the fly
- 212 (73%) were found in the worm
- 120 (42%) were found in yeast

Conclusion: Model organisms, especially Drosophila, can be very useful for studying human disease.

Drosophila genome analysis

By 2007, the complete genomes of 12 different Drosophila species had been sequenced
Although these are all from the same genus, the sequence divergence among the species is about the same as that among mammal
About 7,000 genes had single-copy orthologs in all 12 species and almost all of these showed evidence for expression and lacked transposable element insertions
This may represent the “core” Drosophila genome
Another 5,000 genes showed homology across all species but were not single copy
That is, they were multi-gene families with multiple paralogs in different species.
The number of predicted “unique” or “lineage specific” genes in the different species ranged from hundreds to thousands, but many of these lacked evidence for expression, so it is difficult to determine how many are real, functional genes.

plant genomes - Arabidopsis thaliana

The first plant genome to be sequenced was that of the mustard weed, Arabidopsis thaliana, in 2000.
- small genome
- major model system for plant genetics
- 125 Mb; 25,000 genes
- more genes than the fly (14,000) or the worm (19,000), and slightly more than human (21,000)

Compared to fly and worm, Arabidopsis has more genes that are present as paralogs in the genome
A higher percentage of the Arabidopsis genes are part of multi-gene families -> gene duplication has played an important role in plant genome evolution

Comparison of plant genomes

In 2002, the genomes of two strains of rice, Oryza sativa, were sequenced.

=> This allowed the first comparative genomic analysis in plants
Rice has a genome size of 430 Mb. The number of rice genes is around 40,000-50,000, depending on the method used for prediction.

Results:

80-85% of the predicted Arabidopsis genes had a homolog in rice. Only ≈50% of the predicted rice genes had a homolog in Arabidopsis.
Of the genes shared by Arabidopsis and rice, 30.5% had a homolog in yeast, worm, or fly.
Of the rice genes with no homolog in Arabidopsis, 2.4% had a homolog in yeast, worm, or fly.

Are Arabidopsis genes a subset of rice genes, similar to a plant minimum genome, while rice has evolved additional new genes?

Horizontal gene transfer

Are there bacterial genes in the human genome?

Were genes transferred directly from bacteria to humans (or other vertebrates)?

IHGP human genome paper: “hundreds of human genes appear likely to have resulted from horizontal gene transfer from bacteria at some point in the vertebrate lineage”.
they identified 223 human proteins that had significant homology to bacterial proteins, but no match in yeast, worm, fly, plant, or any other (non-vertebrate) eukaryote that had been sequenced at the time.

How is that possible?

they identified 223 human proteins that had significant homology to bacterial proteins, but no match in yeast, worm, fly, plant, or any other (non-vertebrate) eukaryote that had been sequenced at the time.

Possibilities:

a) Contamination? Unlikely – 35 genes were tested in humans by PCR and are real. Many have orthologs in other vertebrates.

b) Genes present in common ancestor of eukaryotes, but lost in yeast, worm, fly, plant, etc. Requires many independent cases of gene loss, but is possible.

c) Could be transfer from humans to bacteria. 113 of the genes are found in many diverse bacteria species, so this would require many independent gene transfers.

d) The authors prefer the scenario that the genes were transferred directly from bacteria to vertebrates (at least 113 of the genes).

Since then, a number of criticisms and alternate explanations have been published.

general finding is that when more non-vertebrate eukaryote genomes are searched, homologs to these genes can be found.
This supports independent gene loss in some eukaryotes and argues against bacteria-to-human horizontal gene transfer.
phylogenetic analysis of the sequences can be used to test the hypothesis. If there was horizontal transfer, the human genes should be more similar in sequence to the bacterial genes than to other non-vertebrate eukaryotes.
This does not appear to be the case.

At present, the number of potential bacterial genes in the human genome has dropped below 40, and will likely decrease as more diverse eukaryotic genomes are sequenced.

Human/Mouse comparison

After human, the next vertebrate genome to be sequenced was that of the mouse.
A comparison of the human and mouse genomes revealed that at least 98% of the genes had homologs between the two species
On a small scale, the order of genes was well conserved, but on a larger scale, there were many chromosomal re-arrangements.
This means that mouse and human genes are found in conserved blocks (that is, the same genes in same order = “synteny”)
However, the chromosomal locations of these blocks are not well conserved.
For example, gene blocks from mouse chromosome 16 are spread over 6 different human chromosomes.

These types of re-arrangements are common among vertebrate species.

Other, unrelated pathogens also have small genomes

Rickettsia sp. – cause Rocky Mountain and Mediterranean spotted fever in humans (1.2 Mb)
Borrelia burgdorferi - causes Lyme disease in humans (1.4 Mb)
symbiotic bacteria of aphids, Buchnera sp., have a very small genomes (450–670 Kb); BUT, they can not live on their own, but only inside their host.

=> association with a host -> promotes genome reduction

Join Course

Preview

Author

Lea H.

Information

Last changed
2 years ago

Report course