Basic Genomics

Buffl

BEG

by Tanja P.

Name two methods or Developer with which you can Sequenzen dna

Two methods developed in the mid-1970s:

Maxam-Gilbert (chemical) – rarely used
Sanger (dideoxy, enzymatic) – Developed by Frederick Sanger and is still used today with little change to the basic method, although great improvements have been made in efficiency and automation. This method was used for all of the early genome projects, including the initial sequencing of the human genome.

How much of the human sequence is protein coding

less than 2% of the human genome.

What is ab initio prediction, name also pros and cons

use computer programs (such as Genscan or Genie) to identify genes from raw DNA sequence data.
Look for long open reading frames (ORFs) that begin with a start codon (ATG) and end with a stop codon (TAA, TAG, TGA), but contain no internal stop codons.
Can also incorporate intron splice signals, or codon bias information.
Pro – fast and easy to implement on computer, no additional experimental work required.
Con – computer algorithms are not good when there are many, long introns. No experimental evidence that the predicted genes are “real”.

What is:

Le^-m
L/n

Le^-m = total gap length

L/n = average gap size, L=genome len and n = number of random fragments sequenced

Name the two assumptions for random shotgun sequencing

1. Sampling with replacement (starting genomic DNA >> sequenced DNA)

2. All pieces of DNA clone with equal frequency. This is not 100% true. RepetitiveDNA, inverted repeats, high %GC or low %GC fragments often do not clone well in bacteria. In general, the highly-repetitive parts of a genome (usually around the centromeres and telomeres) that contain few genes are known as heterochromatin and are not cloned or sequenced in genome projects. Thus, what is usually published is the euchromatic genome.

How do you estimate that a base is not sequenced in random shotgun sequencing

from the Poisson distribution as P = e^-m, where e is the base of the natural logarithm and m is the sequence coverage

M = COVERAGE A = NUMBER OF TIMES A BASE IS SEQUENCED

What does m stand for in context of random shotgun sequencing?

Coverage (m) = average number of times each base is sequenced

Describe the sequencing method for a genome called “Whole Genome Shotgun” and list pros and cons

entire genome randomly cloned into small-insert vectors and a large number are sequenced
Typically plasmid vectors (2–10 Kb) are used.
raw sequence is then assembled by computer to reconstruct the genome
This approach was used by the private company Celera Genomics for sequencing the human genome.
Pro: can start almost immediately, save time of clone mapping, faster, less expensive
Con: harder to assemble, may be many gaps, “finishing” (filling in the gaps) is more difficult.

Describe the sequencing method for a genome called “clone by clone” and list pros and cons

Genome is cloned into large-insert vectors (BAC, YAC, P1; 100–200 Kb) and these are mapped to form a minimal overlapping set.
each clone is broken up and sequenced individually by the shotgun method.
this method is also known as “hierarchical shotgun sequencing”. Because each of the back clones is then sequenced and assembled using the shotgun approach. Adevenatege: it is easier to assemble a back clone than an entire genome
This approach was used by the publicly-funded international human genome project (IHGP).
For genome projects, typically BAC (bacterial artificial chromosome, approx. 100 Kb) vectors are used.
Pro: less redundant sequencing, easily subdivided among labs, easier assembly, clones available for “finishing” (filling in the gaps) and further research.
Con: clone mapping is difficult and requires a lot of time. For the human genome, more time was spent mapping clones than sequencing them.

What is the C-value and what do you need to know about it?

size of the genome = C-value
C = the amount of DNA in a single haploid cell
C is nearly constant within species, but varies greatly among species.
not a strong correlation between an organism’s complexity and its genome size ( C-value paradox, for example Amoeba with a genome of over 600 Gb)
Homo Sapiens: C-value = 3.2 Gb
E.coli: C-value = 4.7 Mb
D. melangoster = 180 Mb

What is the scale of genomes

1,000 base pairs (bp) = 1 kilobase (Kb); scale of individual genes

1,000,000 bp = 1,000 Kb =1 megabase (Mb); scale of bacterial genomes

1,000,000,000 bp = 1,000,000 Kb = 1,000 Mb =1 gigabase (Gb); scale of vertebrate genomes (Wirbeltier)

give me a quick overiw of “next generation” sequencing

commercial methods such as Illumina and Roche FLX (454)

use massively-parallel methods to simultaneously sequence millions of short pieces of DNA (read lengths are usually in the range of 50–800 bp).
read lengths typically shorter than those of Sanger sequencing and the error rates are higher
replaced the Sanger method for genome sequencing.
There are also new technologies that allow for longer read lengths (10,000 bp or more).

What is the biggest problem for genomics

the longest continuous stretch of sequence that can be read by a single sequencing reaction is ≈ 1,000 bp, and the high-quality portion is typically only 500–700 bp. Thus, to sequence a genome (or any large piece of DNA), many short reads need to be put together in the correct order

Name two methods which can sequence larger pieces of DNA and their pros and cons

1. Sequence Walking (Primer Walking) – a new primer is designed to match end of previous sequence

Pro: minimum amount of sequencing, no assembly required

Con: slow and expensive, must wait for the results of each reaction before performing the next, must design custom primers each time

2. Shotgun Sequencing – sequence all pieces at once, then assemble them in order

Pro: faster, more cost-efficient, high-throughput parallel processing, can use universal primers

Con: requires more (redundant) sequencing, assembly can be difficult

steps for modern read methode

1. All 4 fluorescently-labeled ddNTPs are used in 1 reaction, each a different “color”

2. Fragments are separated in matrix-filled capillary tubes, 1 capillary per reaction

3. Laser detects fluorescence automatically as each fragment exits capillary

4. Computer software “calls bases” and processes sequence files (if sequences were processed by a human at 15 min. per sample, it would take 7 people a full-time week to process 1 day’s output from an automated sequencer)

96 samples can be run parallels

steps for orignal read method

1. Four reactions are run, one with each ddNTP, radioactive labeling

2. Fragments are separated on a polyacrylamide “slab” gel, 1 lane per ddNTP reaction

3. The entire gel is dried and exposed to X-ray film

4. Sequence is interpreted from band order by human inspection

Steps for sequencing DNA

1. DNA and primers heated to denature, then cooled to anneal

2. Polymerase adds nucleotides complementary to the template, starting at end of primer

3. Occasionally a ddNTP is incorporated and the reaction stops (no more bases can be

added after the terminator).

4. DNA fragments of different lengths are separated and the sequence is “read”

Requirements for sequencing DNA

1. Template DNA (to be sequenced), typically purified plasmid + insert DNA

2. Specific primer DNA (≈20 nucleotides complimentary to one strand of the template)

3. DNA polymerase (Enzym replicates DNA)

4. Deoxynucleotides (dATP, dGTP, dCTP, dTTP = dNTPs), high concentration.

5. Dideoxynucleotides (ddATP, ddGTP, ddCTP, ddTTP = ddNTPs), low concentration.

These are known as “terminators” (no base can added after them, random stop -> diff leng of fragments)

6. Labeled deoxynucleotides (radioactive dNTPs or fluorescent ddNTPs), low concentration

What ist repetitive DNA

long stretches of the same DNA sequence repeated many times in tandem.
This makes up most of the heterochromatin. It is enriched at centromeres and telomeres.

What are transposable elements

pieces of DNA that can replicate and move within the genome.
Also known as “Interspersed repetitive DNA”, “jumping genes” or “selfish DNA”.
Make up about 1/2 of the human genome.
Many copies are “dead” or partial TE sequences that can no longer “jump” and are just relics of previously-active TEs.
TEs can be in heterochromatin or euchromatin.

What are pseudogenes

genes that are no longer functional (often duplicates of functional genes).
Typically have a stop codon or frame-shift within their ORF.
May have lost their promoter and not be transcribed.
Pseudogenes are typically in euchromatin.

What is Comparative prediction, name also pros and cons

look for sequences sharing homology with other, known genes.
Can compare different species.
If ORFs are conserved between species, they are likely functional.
This is usually done by searching public databases with programs such as BLAST (basic local alignment search tool).

Pro – fast and easy to implement on computer. Homology may give a hint to gene function.

Con – overlooks unique or fast-evolving genes. Requires sequences from related species.

What is Experimental identification, name also pros and cons

mRNA is isolated from the organism and converted to cDNA by reverse transcription.
cDNA is then sequenced
These sequences are often referred to as ESTs (Expressed Sequence Tags), because they are usually not full-length cDNAs, but only part of an expressed sequence.

Pro – experimental evidence that genes are expressed. Intron/Exon boundaries can be determined by comparing cDNA sequence (which lacks introns) to genome sequence.

Con – requires more experimental work and sequencing. Genes expressed at low levels or regulated temporally or spatially may be overlooked.

Do Introns and integenic regions encode proteins

no but they may contain important gene regulatory information or may be transcribed into funtional, non coding RNA

Contig ?

Ein Contig (von engl. contiguous = angrenzend, zusammenhängend) ist ein Satz überlappender DNA- oder Protein-Stücke (reads), die von derselben genetischen Quelle stammen.[1] Ein solches Contig kann dazu genutzt werden, die Original-DNA-Sequenz dieser genetischen Quelle (z. B. die Sequenz eines Chromosoms) abzuleiten.

What is this

Chromatogram

What does this picture describe when we talking about sequencing many pieces of DNA

How to sequence many pieces of DNA when you don’t know their sequence in order to design a primer?:

Cloning DNA into a known vector such as a plasmid

Then designing primers that are complimentary to the plasmid DNA
If you design primer to both sides of the plasmid then you can generate mate pairs or paired reads which are very important to genome assembly

This is because u know the sequence from each sides comes from the same template and if you know the size of the template you know how much space should be in-between the two seq reads

e.g. if template is 2 KB and each seq read is 500 bp u know that their should be about 1 KB in between the two seq reads

How many codons exist for how many AA

61 codons for 20 AA plus 3 Stopcodons

many aa are coded from more than 1 codon

Join Course

Preview

Author

Tanja P.

Information

Last changed
a year ago

Report course