Definition of transcriptomics
= subfield of functional genomics that focuses on gene expression
-> typically with a focus on mRNA (transcripts),
-> non-coding RNA can also be analyzed
What is the level of gene expression?
= amount of mRNA from each gene that is present in a particular cell, tissue, or organism; considered as a phenotype
-> sometimes “intermediate phenotype” since it is very close to the genotype in the pathway which determines the final phenotype of an organism
How is a phenotype created?
DNA -> mRNA -> protein -> interactions with internal/external environment -> phenotype
Because mRNAs correspond to particular genes in the genome, it is often possible to establish a link between a genotype and an expression phenotype
Which methods can be used to analyse transcriptomics and what questions can it solve?
Large-scale, high-throughput methods
Affymetrix "Affy" GeneChips™
How much transcript is there from each gene (expression level)?
How does expression level change over development (expression profile)?
How does environment/treatment affect gene expression?
What is EST sequencing?
How does it work and what are the pros and cons?
= first genome-wide method used to investigate gene expression
-> has been applied on e.g. Drosophila, human, etc.
EST databases used to estimate expression levels when there is no other experimental evidence.
the more times an EST corresponding to a particular gene was sequenced, the higher is the expression level of that gene.
reverse transcription of mRNA (typically not all of it) to cDNA
sequence large number of cDNA
-> resulting sequence fragments = EST (Expressed Sequence Tags)
gives an estimate of absolute mRNA abundance (if cDNA library is random)
very useful for gene discovery (annotation of expressed regions of genomes and intron/exon boundaries)
expensive, time-consuming, requires large-scale sequencing
redundat sequencing (ESTs from highly-expressed genes are sequenced many times)
must sequence 100,000's of ESTs to get a good representation of genes expressed at low levels
the ESTs only reveal gene expression levels in the particular tissue or sample that was used for mRNA preparation
possibly missing genes expressed in specific tissues, cells, developmental stages, etc.
How is a Microarray contructed?
And what are its pros and cons?
attach specific DNA sequences (probes/spots) to a solid surface (often a glass microscope slide)
-> many thousands of probes, each matching a different gene
can quickly, cost-efficiently do many comparisons and replicates
do not measure absolute, but relative abundance; statistical interpretation may be difficult
What types of DNA can be used for sequencing with Microarrays?
cDNA or EST sequences
(-) need to be cloned for each gene that is used
(+) longer probe sequences, which may give better hybridization signals (especially for cross-species comparisons)
PCR-amplified genomic DNA
(+) can be made to all predicted genes in the genome
(+) specifically designed to reduce crosshybridization (by avoiding sequence regions that are similar in two or more genes)
Synthesized oligonucleotides (typically 36–80 bases long)
advantages same as PCR-amplified genomic DNA
How can the gene expression be measured using hybridizations (“hybs”) in Microarrays?
Purification of RNA from the samples to be campared
Reverse transcription of mRNA to cDNA while labeling with fluorescent dye (one sample “red”, the other “green”)
Place labeled cDNA solution on same microarray (under a coverslip) in equal amounts and hybridized overnight
Removal of excess and unbound DNA by washing
Let array dry and scan with laser to create graphical image
Analyze image to determine relative expression difference (red/green signal intensity for each spot)
Explain the two main approaches that are used to determine which genes are differentially expressed in Microarrays.
chose an arbitrary fold-difference to define genes that are differentially expressed
e.g. fold chance = 2: gene must have an expression level that is at least 2 times higher in one sample than in the other to be considered differentially expressed
often usage of log-scale: if log ratio of expression is >1 (or < –1) would be considered as differentially expressed
= statistical method, such as a t-test, binomial test, ANOVA, or a Bayesian method is used to calculate a p-value for each gene
null hypothesis: gene is expressed equally in the two samples
-> reject, if p-val < cutoff and consider gene as differentially expressed
typical p-val cutoff = 0.05 cannot be used for multiple testing, instead use False Discovery Rate (FDR) of 5%
-> The approaches can be displayed graphically with a “volcano plot”, which shows the fold-change in expression between two samples (on a log2 scale) on the X-axis, and the pvalue (typically on a –log10 scale) on the Y-axis.
Affymetrix "Affy" GeneChips™
How is a GeneChip constructed and how can the expression level be estimated in this approach?
Name the pros and cons.
Affymetrix produces and sells microarrays (GeneChips) for several model species (e.g. human, Drosophila)
arrays made by photolithography
-> = specific oligonucleotide probes (25 bases long) are synthesized directly on the array surface
For each gene, 20 different probes corresponding to different regions of the transcript are present on the array
20 “mismatch controls” that are identical to the above probes except for a single mismatched nucleotide at the center of the sequence (base 13)
Estimate expression level of each gene by the intensity difference between match and mismatch probes, averaged over all 20 probes per genes.
-> not a competitive hybridization of two different samples. Only one sample is hybridized per array.
can buy pre-made chips
high quality control
easy to use
can be expensive
requires Affymetrix machines
short probes (25 bases) are not good for divergent species
useful only for (model) species for which a GeneChip is commercially available
What is SAGE (Serial Analysis of Gene Expression)?
How does it work and name the pros and cons?
= method that is similar to EST sequencing, but more efficient because only short “tags” of around 10–15 bases are sequenced from each cDNA
Before sequencing, the tags are concatenated so that many of them can be sequenced in a single Sanger sequencing reaction
-> requires annotation of the genome, so that the tags can be accurately mapped back to their corresponding genes
Purify mRNA (poly-A) from sample
Use biotinylated oligo dT primer to synthesize double-stranded cDNA
cut cDNA with a restriction enzyme, such as NlaIII which recognizes the sequence CATG and cuts, on average, every 256 bp
purify only the 3' poly dT ends of the cut cDNA in a streptavidin column (binds to biotin attached to the oligo dT primer)
ligate an adapter (short synthesized DNA sequence) to the cut end. The adapter contains a restriction site for the restriction enzyme BsmFI (recognizes GGGAC, but cuts 15 bp away from this sequence into the cDNA fragment)
ligate two adapter ends to each other tail-to-tail to create “ditags”. PCR amplify the ditags with primers complementary to the to adapter sequence
cut again with NlaIII to remove adaptors, leaving a 30 bp ditag
ligate many ditags end-to-end (up to 1 Kb total length), then sequence 1000's of these. (typically sequence 30-40 tags per Sanger sequencing reaction)
-> Each 15-bp tag should give a unique match to a transcript in the genome (random match is very unlikely) and should always be after the 3' most NlaIII site (if not: genes may be missed)
-> To quantify the expression level of a gene: count the number of times that the tag for that gene is sequenced. At least 10,000–50,000 tags should be sequenced to get an accurate estimate of expression (≈300–1000 sequencing reactions).
gives an estimate of absolute transcript abundance
more efficient than large-scale EST sequencing, because many fewer sequencing reactions are required.
still requires much sequencing, which can be expensive
not accurate for rare transcripts
sometimes difficult to map tags to genes
must be repeated for each sample (tissue, sex, treatment, etc.)
High throughput RNA sequencing (RNA-seq)
why is RNA-Seq better than EST-Seq?
Pros and Cons?
follows the same scheme outlined above for EST sequencing.
difference is that a pool of cDNA is used for next generation sequencing.
With this approach hundreds of millions of short EST sequences (usually 50–300 bases in length) can be generated quickly and at low cost.
These sequences can be mapped back to their corresponding gene in the genome and used to quantify gene expression.
RNAseq can also be applied to species with unknown or un-annotated genomes to assemble the transcriptome de novo.
produces direct "counts" of gene expression with very large sample sizes (number of reads per gene), which is good for statistical analysis and comparison of expression between samples
Can be applied to non-model organisms.
Can assemble transcriptomes de novo.
may be more expensive than microarray technologies (but the costs of RNA-seq are dropping rapidly)
Requires more complex bioinformatic analysis than arrays