What are homologs?
genes/proteins with similar sequence
derived from common ancestor
can be classified to orthologs and paralogs
What are Orthologs? Name one example.
homologs derived thorugh speciation (in different species)
e.g human alpha-globin and chimpanzee alpha-globin
What are Paralogs? Name one example.
homologs derived through gene duplication
e.g Human alpha-globin and human beta-globin
What are Analogs? Name one example.
genes/proteins that perform similar functions, BUT dont share common ancestral gene
due to convergent evolution
e.g wings of birds and wings of insects
What is Homology search?
What are some tools and databases?
bioinformatics technique to identify sequences that are similar to given sequence
first step after genome sequence has determined and genes predicted
—> many complete genomes are sequenced —> compare one genome to another —> determine how many genes they have in common
distantly related species —> often impossible to distinguish orthologs and paralogs
tools: BLAST, FASTA
databases: GenBank, Uniprot, PDB
What does BLAST mean?
What is the aim?
For what kinds of amino acids?
Basic Local Alignment Search Tool
most commonly-used homology search tool
not align complete sequences, BUT finds subsequences with best possible alignment
identical: amino acids are same
positive: amino acids may be different, but similar biochemical properties (size, charge)
How does BLAST work?
matches scored by E-value —> number of matches expected at random when searching database of this size with query of this length
similar to probability (p-value)
E=1 —> 1 match would be expected at random from database
lower the E-value, the greater the confidence that sequences are homologue
What is the difference of BLAT to BLAST?
faster algorithm based on 11-mers (DNA) or 4-mers (amino acids) to find matches of:
95% or greater identity over 25 bases or more (DNA)
80% or greater identity over 20 amino acids or more (Proteins)
What alignment tools are for quickly mapping ‘short reads’? How are they working?
very quickly map 35-250 base sequences to reference genome
align complete sequences and expect a very close match with the reference sequence
BWA
bowtie
Stampy
NextGenMap
What are the steps of comparing genes in two genomes?
Gene A from species 1 for BLAST search of species 2 genome
—> best match/hit is gene A’
Gene A’ from species 2 for BLAST search of species 1 genome
—> best match/hit is gene A —> orthologs
BUT: does not guarantee that true orthologs
What are the two major domains, Prokaryotes are divided into? Positive about Prokaryotes=
Bacteria —> common commensal (lives in relationship with host, without harming) and pathogenic bacteria
Archaea —> ancient group of mostly extremophiles (live at high temp, …)
small genom
little repetetive DNA
no introns
medical importance
What are the two closely-related pathogenic bactera, one of the first genomic studies was of?
Mycobacterium leprae —> causes leprosy
4000 coding genes, 6 pseudogenes
Mycobacterium tuberculosis —> causes tuberculosis
1600 coding genes, 1100 pseudogenes
What are pseudogenes?
segments of DNA similar to functional genes, but are non-functional
previously protein-encoding genes with mutations —> lost ability to encode proteins
insertion of stop codon
insertion/deletion of bases that causes frameshift in coding regions
What is special about M. leprae?
lost function of half of its genes
—> longest doubling time of any known bacteria —>
—> cannot be cultured in laboratory (only grows in host cells)
What is special about Mycoplasma genitalium?
smallest genome known in free-living bacteria (580 Kb)
ancestors had many more genes, that were lost over evolution
Name three pathogenic bacteria with a small genome.
Mycoplasma genitalium (580Kb)
Ricksettia sp. —> Rocky Mountain Fever (1.2Mb)
Borrelia burgdorferi —> Lyme disease (1.4Mb)
Which genes are lost? Name some examples
many genes involved in energy metabolism:
—> metabolic intermediates and energy sources taken from host (e.g Buchnera aphidicola)
many genes required for amino acids and vitamin synthesis
—> provided by host (e.g Carsonella)
Why are genes lost?
Selective advantages for smallness
bacteria with smaller genomes can replicate faster (outcompete others)
small changes in DNA content dont affect replication rate
many pathogens retain non-functional pseudogene DNA
small genomes not more densely packed
Mutation pressure
if not selective pressure to maintain gene —> lost due to mutation
there is bias towards
deletions
mutations to A or T
—> obligate pathogens and endosymbionts tend to have small genomes and high %AT
What are Hyperthermophiles, Thermophiles and Mesophiles?
Hyperthermophiles: live at 80-100 C (mostly Archaea, some bacteria
Thermophiles: live at 50-65 C (mostly Archaea, some bacteria)
Mesophiles: live under 50 C (mostly bacteria, some Archaea)
Are there genes specific to hyperthermophiles?
tested by searching the COG (Clusters of Orthologous Groups) database
consider evolutionary relationships
some Archaea dont live in high temperatures
some bacteria do live in high temperatures
—> presence of gene should follow temperature, not phylogenetic relationship
—> 1 of 2800 specific to hyperthermophiles
What may help prevent unwinding DNA at high temperatures?
twists into double-stranded circular DNA
reverse gyrase
large protein (>1000)
2 protein domains (helicase, topoisomerase)
What are the two major questions of comparative genomics?
What is conserved? - What are common requirements for eukaryotic life?
What is different? - What makes each species unique?
What are the three major eukaryotic model organisms?
D. melanogaster (fly)
C. elegans (worm)
S. cerevisiae (yeast)
Of fly, worm and yeast, what was the highest proportion of shared genes with human?
Fly, about 50%
Worm, about 35%
Yeast, about 37%
How many of human genes, associated to diseases, were found in fly, worm, yeast?
230/289 (80%) in fly
212/289 (73%) in worm
120/289 (42%) in yeast
What is the first plant genome to be sequenced and how many genes has it, compared to fly, worm, human? And why is it special?
Arabidopsis thaliana - 25.000 genes
fly - 14.000 genes
worm - 19.000 genes
human - 21.000 genes
Arabidopsis has more genes present as paralogs
higher percentage of A. genes are part of multi-gene family
—> gene duplication had huge role in plant genome evolution
How many genes has rice?
What. is the genome size?
What is the Name?
Oryza sativa
40.000-50.000 genes
genome size: 430 Mb
Compare rice to plant.
Then to yeast, worm, fly.
80-85% of Arabidopsis had homolog in rice
50% of rice had homolog in plant
30% of genes shared in plant + rice had homolog in yeast, fly or worm
What is horizontal gene transfer (HGT)?
movement of genetic material between organisms in a manner other than traditional reproduction
—> allows for direct transfer of genes between different species
Are there bacterial genes in the human genome?
studies suggest yes, BUT
subsequent analyses significantly reduced number (fewer than 40 potential bacterial genes)
Were genes transferred directly from bacteria to humans or other vertebrates?
hypothesis suggest yes BUT
later studies: these genes are more likely result of independent gene loss in some eukaryotes
Why was contamination unlikely to explain the presence of bacterial-like genes in humans?
35 genes tested in humans by PCR —> confirmed to be real
many of these have orthologs in other vertebrates
What is one possibility for the presence of bacterial-like genes in humans other than horizontal gene transfer?
these genes were present in common ancestor in eukaryotes
—> lost in yeast, worm, fly, plant, …
require many independent cases of gene loss
How have subsequent findings challenged the idea of bacteria-to-human horizontal gene transfer?
when more non-vertebrate eukaryote genomes are searched
—> homologs to these genes can be found
supports independent gene loss in some eukaryotes
How does phylogenetic analysis help test the hypothesis of horizontal gene transfer?
tests hypothesis by comparing sequences
if true —> human genes should be more similar to bacterial genes than to other non-vertebrate
What is the current understanding of the number of bacterial genes in the human genome?
below 40, expected to decrease as more eukaryotic genomes are sequenced
What are the possible explanations for the presence of bacterial-like genes in the human genome as discussed in the IHGP human genome paper?
Contamination - unlikely. 35 genes were tested in PCR —> real, many have orthologs in other vertebrates
genes in common ancestor in eukaryotes, lost in yeast, fly, …
—> requires many independent cases of loss
transfer from human to bacteria - 113 found in many diverse bacteria species —> requires many independent gene transfer
authors prefer genes directly transferred from bactera to vertebrates (at least 113)
What percentage of human genes have homologs in the mouse genome?
at least 98%
How is the order of genes conserved between the human and mouse genomes on a small and large scale?
small scale: well conserved —> same genes found in same order (synteny)
large scale: many chromosomal re-arrangements
e.g gene blocks from mouse chromosome 16 are spread over 6 different human chromosomes
Zuletzt geändertvor 6 Monaten