Genome size and chromosomes
180 Mb
5 linear chromosomes (X, 2, 3, 4, Y)
Major chromosomes: X, 2, 3
2 and 3 have left and right arms on either side of the centromere (2L, 2R, 3L, 3R)
X/1: only one arm
Y: only present in Males and almost completely heterochromatin
4: very small (1% of genome) -> “dot” chromosome
How much DNA was sequenced?
Sequenced amount: 120 Mb euchromatin (heterochromatin was not sequenced, does not clone well in bacterial vectors)
Who sequenced the Genome?
Sequenced by
private, for-profit company Celera Genomics (led by Craig Venter) using WGS
and collaborated with publicly founded projects (BDGP = Berkeley Drosophila Genome Project, EDGP = European Drosophila Genome Project) which used clone-by-clone approach
Why the Drosophila Genome? (Advantages)
Historical importance
Large research community (around 5.000 people worldwide)
Powerful research tools (already 2.500 known genes)
Modest genome size
(Good test for WGS before sequencing the human genome)
Genome Sequencing Strategy (and duration)
Some sequence (around 29 Mb from 1.000 BACs) was already available from public projects (BDGP/EDGP)
Shotgun Sequencing (Celera Genomics):
Genomic DNA broken into 3 size classes (2 Kb, 10 Kb, 130 Kb) and cloned into plasmids (2 Kb, 10 Kb) or BACs (130 Kb)
Inserts sequenced with forward and reverse primers to get “paired reads” for over 70% of clones
Total reads: 1,903,468 from 2 Kb & 1,278,386 from 10 Kb & 19,738 from 130 Kb
3,201,592 total reads (aver. read length = 551 bp)
Total seq ≈ 1.7 Gb ≈ 12.5x coverage (assembled by computer)
Gap closure left to the public projects
Four moths using 300 ABI 3700 (96 capillaries per machine), liquid handling robots and 50 people
Assembly Strategy
with Celera Assembler
Steps:
Screener: mask (hide) all known repetitive sequences, so they are not used in alignment (to prevent wrong interpretation of overlaps)
Overlapper: pairwise comparison of all reads for overlaps (require at least 40 bp of overlap with < 6% mismatch in unmasked sequence)
Unitigger: build unique contigs of overlapping fragments supported by paired reads
Scaffolder: combine unitigs with paired BAC read data to make “scaffolds” (collection of ordered unitigs with approx. known distances between them)
Repeat resolution:
fill-in sequence around masked repeats
Rocks = contigs supported by at least 2 mate pairs
Stones = contigs supported by 1 mate pair plus overlap with another mate pair supported contig
Pebbles = best overlap tiling across gaps without mate pair support
Consensus: collapse overlaps into single sequence using highest quality sequence reads.
Processor time for Assembly
Processor time: only the first step in assembly (pairwise comparison of 3.2 million reads for overlaps) takes 5 x 10^12 comparisons
-> Celera supercomputer was capable of 32 million comparisons per seconds, would still require around 48 h for initial comparison
-> Parallel processing
What 3 assemblies did Celera do?
Celera made 3 assemblies
Joint: all shotgun + public data (best)
12.5 x WGS: only shotgun data
6.5 x WGS: only about half of the shotgun data (worst)
Gene Annotation methods
Ab initio prediction:
Programs Genie and Genscan were used to predict ORFs
Genie incorporated Drosophila specific parameters (intron signals, codon usage, EST (expressed sequence tag) data)
-> Better predictions
Genscan was not customized for Drosophila
-> Many false positives
Experimental identification: EST sequences from BDGP plus database information from gene and protein sequences (around 2.500 genes)
if a gene was predicted from one of the programs and supported bei EST or protein match it was considerd to be real or predicted from both programs
Annotation “jamboree”: over 40 Drosophila researchers from around the world met at Celera for two weeks to find and classify as many genes as possible
Total amount of protein-coding genes (2000)
Aount with unknown function
Current number of annotated protein-encoding genes (2020)
Total prediction: around 13.600 genes, fewer than C. elegans (the worm, which has around 20.000 genes)
About 8.000 of the Drosophila genes (60%) were of unknown function
Current number of annotated protein-encoding genes = 13.969
Updates of latest Drosophila genome release (2020)
includes complete euchromatin of X, 2L, 2R, 3L and 3R plus about 9 Mb of heterochromatin sequence
including improved scaffolds of Y and 4 plus improved gene annotations
What was the conclusion after encoding the D.melanogaster Genome
WGS works for a eukaryote with a relatively large genome and repetitive DNA, but even 12.5x coverage leaves many gaps and much “finishing” work.
Last changeda year ago