Why do we sequence, and what does sequencing mean? / Influence of sequencing initiatives on biodiversity.
• Documentation: comparative genomics for documenting different species.
• Use: identifying natural products with medical relevance.
• Protection: identifying resistance genes in beech treees
Why do we sequence?
To read genetic information so we can find genes, detect disease variants, understand how organisms work and adapt, compare genomes, and uncover evolutionary relationships for medicine, agriculture, and biotechnology.
What does sequencing mean?
Determining the exact order of building blocks in a molecule—most commonly the order of DNA bases (A, T, G, C).
Influence of sequencing initiatives on biodiversity?
They’ve accelerated species discovery, enabled rapid monitoring via DNA barcoding and environmental DNA, provided reference genomes revealing adaptations and population health, and directly strengthened conservation decisions by identifying genetic vulnerabilities.
Why should RNA be sequenced?
An organism/tissue can be examined under different experimental conditions, e.g. which gene is upregulated or downregulated.
• A complete genome, including regulatory regions and introns, is not important if you are only interested in genes/proteins.
How is RNA-seq reconstructed? Compare.
• Reference-based approach.
• De novo approach.
Why should k not be too small in Trinity assembly?
• The k-mer size should not be too small so that parts of the sequence that occur more frequently can be distinguished from each other.
• For repetitive patterns, should be larger than the repetitive pattern.
• Ideally choose so that each k-mer occurs only once.
Why should the overlap in Trinity assembly be k-1?
k-1 gives the maximum possible overlap of the k-mers, which increases specificity.
If k = 1, the contig could not be extended and would consist only of k-mers.
If k > 1, errors can occur.
Difference between FASTA and FASTQ.
• FASTQ: sequence plus quality values.
• FASTA: sequence only, as a file format.
What does a database do? Name the 3 points.
• Collect and select individual pieces of information.
• Combine and integrate information into a database.
• Present the information, e.g. through a GUI/API.
What can I conclude if I do not find something in a database?
The database may be incomplete.
Why do gene databases and protein structure databases (PDB) grow at different rates?
• Protein databases: linear growth, limited by laboratory work.
• Sequence databases: exponential growth, because technology becomes better and faster.
How is NCBI divided?
RefSeq and GenBank
What are intrinsic/extrinsic information?
• Intrinsic = information that can be extracted from the sequence itself.
• Extrinsic = information obtained by comparison with other sequences/models.
What are PSSMs and how are they calculated?
• Position-specific scoring matrices used to identify motifs in biological sequences (DNA, RNA, protein).
1. Training data.
2. Create a count matrix.
3. Add pseudocounts.
4. Calculate the Kullback-Leibler distance for each position.
5. The sum of the Kullback-Leibler distances for each position gives the PSSM score for the sequence.
6. Difference between a scoring matrix (BLOSUM62) and a PSSM.
• BLOSUM62 is position-independent: the score for substituting G with A is the same at every position.
• PSSM is position-specific: the score for substituting G with A depends on the position.
Name the 3 sub-ontologies of Gene Ontolog
Cellular Component, Molecular Function, Biological Process.
What is GO used for, and what can it do?
• Assign GO terms to genes for functional annotation.
• Standardized terminology makes categorization possible.
• Each gene is described with specific GO terms; the path to get there stays the same, so less data has to be stored.
Advantage of HMM over PSSM and BLOSUM62?
• HMMs can find motifs even when the length changes due to insertions/deletions, and motifs can be used to assign function to sequences.
• PSSMs are position-specific but only work with fixed length; insertions/deletions strongly change scores. PSSMs can still recognize motifs and assign function.
• BLOSUM62 is position-independent and gives no information about function.
What does KEGG provide?
• KEGG shows the organization of genes in different metabolic pathways.
How is KEGG structured / how does it work?
• KEGG assigns each gene a KO term that functionally describes it.
• The database contains orthologs with experimentally characterized genes/proteins.
• KEGG Mapper creates metabolic pathways as a network of KO terms.
Relationship between sequence similarity and homology?
• Homology refers only to evolutionary relationships.
• Sequence similarity is not part of the concept of homology; it measures how much two sequences match.
• Sequences that are more similar than expected by chance are evolutionarily related, i.e. homologous.
What evidence exists for GO terms?
• Experimental annotation.
• Electronic annotation, most common.
• Curated non-experimental annotation.
What is functional annotation/GO good for? Properties of GO.
• Describing gene function.
• Restricted vocabulary.
• Hierarchical organization.
• Directed acyclic graph.
• 3 sub-ontologies.
• Evidence codes.
What are orthologs and how do they arise? / What happens during a speciation event at the molecular level, and what are the resulting sequences of a gene called?
• Orthologs are homologous sequences that were separated by a speciation event.
1. Speciation event: reproductive isolation of two populations leads to separation of genetic lineages.
2. Isolated evolution of the sequences.
3. The resulting sequences are called orthologs.
What is the difference between de novo assembly and reference-based assembly?
• In reference-based assembly, a sufficiently well-annotated genome of the organism already exists. The reads can then simply be mapped to it. It is like a puzzle with the solution picture already available.
• De novo assembly joins overlapping sequence reads with sufficient similarity into longer sequences, also called contigs. The contigs reconstruct the original transcript. De novo assembly is a puzzle without a template.
• If a genome is available, reference-based assembly is always the better and easier approach. For transcript analysis, for example, different isoforms can be compared more easily with the reference-based approach. Split reads and exon-intron boundaries can also be identified, as well as which introns were spliced out. It can also reveal which transcripts are completely missing from the dataset.
You are doing transcriptome assembly and are interested in exon-intron boundaries.
Which assembly approach do you use?
What do you need for this?
Which assembly approach do you use? Reference-based approach.
• What do you need for this? A reference genome.
You are analyzing centromeres, which are highly repetitive regions.
• Which sequencing strategy do you use?
• Why does it make sense to use a second sequencing method as well?
• PacBio and Nanopore, because they provide long reads.
PacBio and Nanopore have the “problem” that only about 85% of reads are correct, so around 15% contain errors.
• PacBio reads often show insertions.
• Nanopore sequencing often shows deletions, especially when the same nucleotide repeats several times.
• Illumina sequencing is about 99% correct but can only generate short reads.
• Therefore Nanopore and Illumina are combined, and the sequences are compared. Nanopore provides long reads that are corrected by short Illumina reads.
What is the basic principle of Illumina sequencing?
• Many sequences can be sequenced in parallel.
• It is automatable.
• Read accuracy is high, about 99.3% per read.
a. What are the advantages?
b. What are the disadvantages?
• Only short reads can be generated.
• Sequencing quality decreases toward the end of the read.
Why does quality decrease with increasing read length in Illumina sequencing?
• A cluster consists of many copies of a sequence. When sequencing starts, nucleotides are incorporated.
• It can happen that not all DNA polymerases in the cluster incorporate a nucleotide, even though they should have.
• As a result, these sequences are no longer synchronized with the cluster.
• Over the course of sequencing, these errors accumulate, so by the end of 125 sequenced bases the signal is no longer as good as at the beginning.
What is an isoform, and what mechanism lies behind it?
• An isoform arises in eukaryotes through alternative splicing, meaning it is a possible transcript of a gene.
• Multiple proteins can arise from one transcript.
Which two patterns occur, and what causes them?
• Cassette exons: different exons can be combined, but not every exon has to be present; this is exon skipping.
• Intron retention.
What is the difference between a scaffold and a contig?
• Contig: a contig is a set of reads connected by overlap of their sequences. All reads belong to one and only one contig, and each contig contains at least one read. Contigs have no Ns.
• Scaffold: consists of ordered and oriented, but usually non-overlapping, contigs separated by gaps of approximately known length. Scaffolds are usually formed by identifying contig pairs that each contain a read from a read pair.
What information do we need to create a scaffold?
• We need information about how many Ns are between the individual contigs.
• Read-pair information is needed to connect contigs, since they do not have to overlap.
What is a model, and what should it do?
• A structured representation of something, highlighting relevant features and ignoring random similarities.
• It describes the variability between different instances of the modeled object.
• It helps classify new members of the set of candidates.
• It helps identify outliers that deviate from the average more than expected.
What does overfitting mean in the context of models?
• Overfitting means that too many explanatory variables were used to describe a model.
• This makes the model too narrow, e.g. focused on one very specific pig breed, while others are ignored.
Why are training data used here?
• A training dataset contains examples used to learn patterns and relationships in the data.
• The algorithm’s weights are adjusted using the training data, meaning the algorithm learns from them.
We compare two sequences: how can we tell that they contain conserved regions?
• By applying a PSSM to score them.
• If a long time has passed and conserved sections are still present, these are evolutionarily important sequence regions.
• They are functionally relevant, and functionally relevant regions have not changed over time.
What is the difference between BLOSUM62 and a PSSM?
• Scores the position-independent probability of amino acid substitutions.
• The model is derived from alignments of homologous sequences.
• This allows similarity between two sequences to be evaluated.
PSSM
• Position-specific.
• A model is generated from training data.
• A model for a specific motif.
• Amino acid occurrence at each position is described by its probability distribution, i.e. a count matrix.
• Sequence is compared with the model.
• This allows recognition of conserved patterns (motifs) in the sequence being compared.
Why is a Hidden Markov Model suitable for classifying protein domains?
• Length-independent.
• The model allows partial domains, duplications, and links.
• It is complex enough to cover all possible cases without overfitting.
• PSSMs are faster, but worse, because they are length-variable.
Name the essential differences between GenBank and RefSeq.
GenBank is not curated, author-submitted, can contain multiple records for the same locus, and records can contradict each other.
• RefSeq is curated, created by NCBI from existing data, revised as new data emerge, and usually contains single records for major organisms.
What are conserved sequences?
Regions that have not changed over time because they are functionally relevant.
Describe the most important steps from sequence dataset to functional annotation.
• QC.
• Trimming.
• Assembly.
• Assembly validation.
• Annotation, e.g. domains & motifs, orthology search, homology search.
What are the error profiles of Illumina, PacBio, and Oxford Nanopore?
• Base substitutions are most common in Illumina.
• Deletions are most common in Oxford Nanopore.
• Insertions are most common in PacBio.
What can sequencing errors cause in terms of false negatives and false positives?
• Errors are less serious when detecting conservation between species; at worst they weaken the signal, which may cause conserved regions to be missed. That is a false negative.
• If you compare different individuals of the same species, where explicit differences matter, reconstruction errors can make a huge difference. That is a false positive.
What defines a paralogous gene, and by which event can paralogous genes arise?
• Paralogous genes are similar in sequence but located at different positions in the genome.
• Paralogous genes arise through duplication events.
What defines an orthologous gene, and by which event can orthologous genes arise?
• Orthologous genes differ in sequence from each other but derive from the same ancestral gene.
• Orthologs arise through speciation events.
How can orthologous genes be identified?
• By sequence similarity, comparing the entire sequence with all models in a database.
• If similarity is stronger than expected by chance, the sequence is assumed to be an ortholog.
How do homology and sequence identity differ, and how are they related?
• Homology means that two sequences share a common ancestor.
• It is not a statement about sequence similarity.
• It is a yes/no concept.
• Homology does not say anything about function.
• Sequences that are not related are not more similar than expected by chance.
• Sequences that are more similar than expected by chance are homologous.
• This does not mean that homologous sequences must always be more similar than chance.
How is an HMM conceptually built?
• A position-specific but length-independent model for identifying conserved regions.
• Generated via a multiple sequence alignment, building on PSSMs.
• Position-specific columns are transformed into match states.
• Add delete states and insert states.
• Has start and end states and can represent repeated domains through joining states.
What are the basic properties of a motif?
• A motif is usually a non-stable or not independently folding region.
• It presents extrinsic information in protein sequence analysis.
• Typically a short sequence region associated with a function.
• Fixed length, usually without gaps.
• Relatively high false-positive rate, because motifs are short and therefore likely to occur by chance.
• Represented by a regular expression or by a PSSM.
• PSSMs often include additional prior information for unobserved data.
• Commonly used to predict binding sites and modification sites.
• PSSMs can also be used to annotate domains quickly.
Describe the different phases of de novo assembly of transcriptome data via Trinity. Assign the individual steps to the corresponding algorithm.
Jellyfish:
• Extract and count k-mers (k=25) from the reads.
• Store k-mers in a hash table with their frequencies.
• Sort by frequency.
Inchworm:
• Contig assembly from k-mers.
• Start at the top of the hash table, i.e. the most frequent k-mers.
• Search the hash table for k-mers that overlap with the chosen k-mer by .
• Each k-mer may only be used once.
• Extend the contig in 5’ and 3’ direction.
• Continue until no further extension is possible.
• Then take the next most frequent unused k-mer and assemble a new contig.
Chrysalis:
• Integrates contigs overlapping by , clustering them together, and extends using paired-end information.
• Assigns reads to the clusters that support the contigs.
• Converts the clusters into de Bruijn graphs.
Butterfly:
• Simplifies the de Bruijn graph.
• Records only where nucleotides vary.
• Checks which reads favor which path.
• These preferences can then be translated into different transcripts.
Why are adapters needed for sequencing?
DNA polymerase cannot synthesize de novo; it needs something to attach to.
Zuletzt geändertvor 10 Tagen