Name two methods or Developer with which you can Sequenzen dna
Two methods developed in the mid-1970s:
Maxam-Gilbert (chemical) – rarely used
Sanger (dideoxy, enzymatic) – Developed by Frederick Sanger and is still used today with little change to the basic method, although great improvements have been made in efficiency and automation. This method was used for all of the early genome projects, including the initial sequencing of the human genome.
How much of the human sequence is protein coding
less than 2% of the human genome.
What is ab initio prediction, name also pros and cons
use computer programs (such as Genscan or Genie) to identify genes from raw DNA sequence data.
Look for long open reading frames (ORFs) that begin with a start codon (ATG) and end with a stop codon (TAA, TAG, TGA), but contain no internal stop codons.
Can also incorporate intron splice signals, or codon bias information.
Pro – fast and easy to implement on computer, no additional experimental work required.
Con – computer algorithms are not good when there are many, long introns. No experimental evidence that the predicted genes are “real”.
What is:
Le^-m
L/n
Le^-m = total gap length
L/n = average gap size, L=genome len and n = number of random fragments sequenced
Name the two assumptions for random shotgun sequencing
1. Sampling with replacement (starting genomic DNA >> sequenced DNA)
2. All pieces of DNA clone with equal frequency. This is not 100% true. RepetitiveDNA, inverted repeats, high %GC or low %GC fragments often do not clone well in bacteria. In general, the highly-repetitive parts of a genome (usually around the centromeres and telomeres) that contain few genes are known as heterochromatin and are not cloned or sequenced in genome projects. Thus, what is usually published is the euchromatic genome.
How do you estimate that a base is not sequenced in random shotgun sequencing
from the Poisson distribution as P = e^-m, where e is the base of the natural logarithm and m is the sequence coverage
M = COVERAGE A = NUMBER OF TIMES A BASE IS SEQUENCED
What does m stand for in context of random shotgun sequencing?
Coverage (m) = average number of times each base is sequenced
Describe the sequencing method for a genome called “Whole Genome Shotgun” and list pros and cons
entire genome randomly cloned into small-insert vectors and a large number are sequenced
Typically plasmid vectors (2–10 Kb) are used.
raw sequence is then assembled by computer to reconstruct the genome
This approach was used by the private company Celera Genomics for sequencing the human genome.
Pro: can start almost immediately, save time of clone mapping, faster, less expensive
Con: harder to assemble, may be many gaps, “finishing” (filling in the gaps) is more difficult.
Describe the sequencing method for a genome called “clone by clone” and list pros and cons
Genome is cloned into large-insert vectors (BAC, YAC, P1; 100–200 Kb) and these are mapped to form a minimal overlapping set.
each clone is broken up and sequenced individually by the shotgun method.
this method is also known as “hierarchical shotgun sequencing”. Because each of the back clones is then sequenced and assembled using the shotgun approach. Adevenatege: it is easier to assemble a back clone than an entire genome
This approach was used by the publicly-funded international human genome project (IHGP).
For genome projects, typically BAC (bacterial artificial chromosome, approx. 100 Kb) vectors are used.
Pro: less redundant sequencing, easily subdivided among labs, easier assembly, clones available for “finishing” (filling in the gaps) and further research.
Con: clone mapping is difficult and requires a lot of time. For the human genome, more time was spent mapping clones than sequencing them.
What is the C-value and what do you need to know about it?
size of the genome = C-value
C = the amount of DNA in a single haploid cell
C is nearly constant within species, but varies greatly among species.
not a strong correlation between an organism’s complexity and its genome size ( C-value paradox, for example Amoeba with a genome of over 600 Gb)
Homo Sapiens: C-value = 3.2 Gb
E.coli: C-value = 4.7 Mb
D. melangoster = 180 Mb
What is the scale of genomes
1,000 base pairs (bp) = 1 kilobase (Kb); scale of individual genes
1,000,000 bp = 1,000 Kb =1 megabase (Mb); scale of bacterial genomes
1,000,000,000 bp = 1,000,000 Kb = 1,000 Mb =1 gigabase (Gb); scale of vertebrate genomes (Wirbeltier)
give me a quick overiw of “next generation” sequencing
commercial methods such as Illumina and Roche FLX (454)
use massively-parallel methods to simultaneously sequence millions of short pieces of DNA (read lengths are usually in the range of 50–800 bp).
read lengths typically shorter than those of Sanger sequencing and the error rates are higher
replaced the Sanger method for genome sequencing.
There are also new technologies that allow for longer read lengths (10,000 bp or more).
What is the biggest problem for genomics
the longest continuous stretch of sequence that can be read by a single sequencing reaction is ≈ 1,000 bp, and the high-quality portion is typically only 500–700 bp. Thus, to sequence a genome (or any large piece of DNA), many short reads need to be put together in the correct order
Name two methods which can sequence larger pieces of DNA and their pros and cons
1. Sequence Walking (Primer Walking) – a new primer is designed to match end of previous sequence
Pro: minimum amount of sequencing, no assembly required
Con: slow and expensive, must wait for the results of each reaction before performing the next, must design custom primers each time
2. Shotgun Sequencing – sequence all pieces at once, then assemble them in order
Pro: faster, more cost-efficient, high-throughput parallel processing, can use universal primers
Con: requires more (redundant) sequencing, assembly can be difficult
steps for modern read methode
1. All 4 fluorescently-labeled ddNTPs are used in 1 reaction, each a different “color”
2. Fragments are separated in matrix-filled capillary tubes, 1 capillary per reaction
3. Laser detects fluorescence automatically as each fragment exits capillary
4. Computer software “calls bases” and processes sequence files (if sequences were processed by a human at 15 min. per sample, it would take 7 people a full-time week to process 1 day’s output from an automated sequencer)
96 samples can be run parallels
steps for orignal read method
1. Four reactions are run, one with each ddNTP, radioactive labeling
2. Fragments are separated on a polyacrylamide “slab” gel, 1 lane per ddNTP reaction
3. The entire gel is dried and exposed to X-ray film
4. Sequence is interpreted from band order by human inspection
Steps for sequencing DNA
1. DNA and primers heated to denature, then cooled to anneal
2. Polymerase adds nucleotides complementary to the template, starting at end of primer
3. Occasionally a ddNTP is incorporated and the reaction stops (no more bases can be
added after the terminator).
4. DNA fragments of different lengths are separated and the sequence is “read”
Requirements for sequencing DNA
1. Template DNA (to be sequenced), typically purified plasmid + insert DNA
2. Specific primer DNA (≈20 nucleotides complimentary to one strand of the template)
3. DNA polymerase (Enzym replicates DNA)
4. Deoxynucleotides (dATP, dGTP, dCTP, dTTP = dNTPs), high concentration.
5. Dideoxynucleotides (ddATP, ddGTP, ddCTP, ddTTP = ddNTPs), low concentration.
These are known as “terminators” (no base can added after them, random stop -> diff leng of fragments)
6. Labeled deoxynucleotides (radioactive dNTPs or fluorescent ddNTPs), low concentration
What ist repetitive DNA
long stretches of the same DNA sequence repeated many times in tandem.
This makes up most of the heterochromatin. It is enriched at centromeres and telomeres.
What are transposable elements
pieces of DNA that can replicate and move within the genome.
Also known as “Interspersed repetitive DNA”, “jumping genes” or “selfish DNA”.
Make up about 1/2 of the human genome.
Many copies are “dead” or partial TE sequences that can no longer “jump” and are just relics of previously-active TEs.
TEs can be in heterochromatin or euchromatin.
What are pseudogenes
genes that are no longer functional (often duplicates of functional genes).
Typically have a stop codon or frame-shift within their ORF.
May have lost their promoter and not be transcribed.
Pseudogenes are typically in euchromatin.
What is Comparative prediction, name also pros and cons
look for sequences sharing homology with other, known genes.
Can compare different species.
If ORFs are conserved between species, they are likely functional.
This is usually done by searching public databases with programs such as BLAST (basic local alignment search tool).
Pro – fast and easy to implement on computer. Homology may give a hint to gene function.
Con – overlooks unique or fast-evolving genes. Requires sequences from related species.
What is Experimental identification, name also pros and cons
mRNA is isolated from the organism and converted to cDNA by reverse transcription.
cDNA is then sequenced
These sequences are often referred to as ESTs (Expressed Sequence Tags), because they are usually not full-length cDNAs, but only part of an expressed sequence.
Pro – experimental evidence that genes are expressed. Intron/Exon boundaries can be determined by comparing cDNA sequence (which lacks introns) to genome sequence.
Con – requires more experimental work and sequencing. Genes expressed at low levels or regulated temporally or spatially may be overlooked.
Do Introns and integenic regions encode proteins
no but they may contain important gene regulatory information or may be transcribed into funtional, non coding RNA
Contig ?
Ein Contig (von engl. contiguous = angrenzend, zusammenhängend) ist ein Satz überlappender DNA- oder Protein-Stücke (reads), die von derselben genetischen Quelle stammen.[1] Ein solches Contig kann dazu genutzt werden, die Original-DNA-Sequenz dieser genetischen Quelle (z. B. die Sequenz eines Chromosoms) abzuleiten.
What is this
Chromatogram
What does this picture describe when we talking about sequencing many pieces of DNA
How to sequence many pieces of DNA when you don’t know their sequence in order to design a primer?:
Cloning DNA into a known vector such as a plasmid
Then designing primers that are complimentary to the plasmid DNA
If you design primer to both sides of the plasmid then you can generate mate pairs or paired reads which are very important to genome assembly
This is because u know the sequence from each sides comes from the same template and if you know the size of the template you know how much space should be in-between the two seq reads
e.g. if template is 2 KB and each seq read is 500 bp u know that their should be about 1 KB in between the two seq reads
How many codons exist for how many AA
61 codons for 20 AA plus 3 Stopcodons
many aa are coded from more than 1 codon
Last changeda year ago