Why do we sequence, and what does sequencing mean? / Influence of sequencing initiatives on biodiversity.
• Documentation: comparative genomics for documenting different species.
• Use: identifying natural products with medical relevance.
• Protection: identifying resistance genes in beech trees.
Why should RNA be sequenced?
An organism/tissue can be examined under different experimental conditions, e.g. which gene is upregulated or downregulated.
• A complete genome, including regulatory regions and introns, is not important if you are only interested in genes/proteins.
How is RNA-seq reconstructed? Compare.
• Reference-based approach.
• De novo approach.
Why should not be too small in Trinity assembly?
• The k-mer size should not be too small so that parts of the sequence that occur more frequently can be distinguished from each other.
• For repetitive patterns, should be larger than the repetitive pattern.
• Ideally choose so that each k-mer occurs only once.
Why should the overlap in Trinity assembly be ?
• gives the maximum possible overlap of the k-mers, which increases specificity.
• If , the contig could not be extended and would consist only of k-mers.
• If , errors can occur.
Difference between FASTA and FASTQ.
• FASTQ: sequence plus quality values.
• FASTA: sequence only, as a file format.
What does a database do? Name the 3 points.
• Collect and select individual pieces of information.
• Combine and integrate information into a database.
• Present the information, e.g. through a GUI/API.
What can I conclude if I do not find something in a database?
The database may be incomplete.
Why do gene databases and protein structure databases (PDB) grow at different rates?
• Protein databases: linear growth, limited by laboratory work.
• Sequence databases: exponential growth, because technology becomes better and faster.
How is NCBI divided?
RefSeq and GenBank
What are intrinsic/extrinsic information?
• Intrinsic = information that can be extracted from the sequence itself.
• Extrinsic = information obtained by comparison with other sequences/models.
What are PSSMs and how are they calculated?
• Position-specific scoring matrices used to identify motifs in biological sequences (DNA, RNA, protein).
1. Training data.
2. Create a count matrix.
3. Add pseudocounts.
4. Calculate the Kullback-Leibler distance for each position.
5. The sum of the Kullback-Leibler distances for each position gives the PSSM score for the sequence.
6. Difference between a scoring matrix (BLOSUM62) and a PSSM.
• BLOSUM62 is position-independent: the score for substituting G with A is the same at every position.
• PSSM is position-specific: the score for substituting G with A depends on the position.
Name the 3 sub-ontologies of Gene Ontolog
Cellular Component, Molecular Function, Biological Process.
What is GO used for, and what can it do?
• Assign GO terms to genes for functional annotation.
• Standardized terminology makes categorization possible.
• Each gene is described with specific GO terms; the path to get there stays the same, so less data has to be stored.
Advantage of HMM over PSSM and BLOSUM62?
• HMMs can find motifs even when the length changes due to insertions/deletions, and motifs can be used to assign function to sequences.
• PSSMs are position-specific but only work with fixed length; insertions/deletions strongly change scores. PSSMs can still recognize motifs and assign function.
• BLOSUM62 is position-independent and gives no information about function.
What does KEGG provide?
• KEGG shows the organization of genes in different metabolic pathways.
How is KEGG structured / how does it work?
• KEGG assigns each gene a KO term that functionally describes it.
• The database contains orthologs with experimentally characterized genes/proteins.
• KEGG Mapper creates metabolic pathways as a network of KO terms.
Relationship between sequence similarity and homology?
• Homology refers only to evolutionary relationships.
• Sequence similarity is not part of the concept of homology; it measures how much two sequences match.
• Sequences that are more similar than expected by chance are evolutionarily related, i.e. homologous.
What evidence exists for GO terms?
• Experimental annotation.
• Electronic annotation, most common.
• Curated non-experimental annotation.
What is functional annotation/GO good for? Properties of GO.
• Describing gene function.
• Restricted vocabulary.
• Hierarchical organization.
• Directed acyclic graph.
• 3 sub-ontologies.
• Evidence codes.
What are orthologs and how do they arise? / What happens during a speciation event at the molecular level, and what are the resulting sequences of a gene called?
• Orthologs are homologous sequences that were separated by a speciation event.
1. Speciation event: reproductive isolation of two populations leads to separation of genetic lineages.
2. Isolated evolution of the sequences.
3. The resulting sequences are called orthologs.
What is the difference between de novo assembly and reference-based assembly?
• In reference-based assembly, a sufficiently well-annotated genome of the organism already exists. The reads can then simply be mapped to it. It is like a puzzle with the solution picture already available.
• De novo assembly joins overlapping sequence reads with sufficient similarity into longer sequences, also called contigs. The contigs reconstruct the original transcript. De novo assembly is a puzzle without a template.
• If a genome is available, reference-based assembly is always the better and easier approach. For transcript analysis, for example, different isoforms can be compared more easily with the reference-based approach. Split reads and exon-intron boundaries can also be identified, as well as which introns were spliced out. It can also reveal which transcripts are completely missing from the dataset.
You are doing transcriptome assembly and are interested in exon-intron boundaries.
Which assembly approach do you use?
What do you need for this?
Which assembly approach do you use? Reference-based approach.
• What do you need for this? A reference genome.
You are analyzing centromeres, which are highly repetitive regions.
• Which sequencing strategy do you use? PacBio and Nanopore, because they provide long reads.
• Why does it make sense to use a second sequencing method as well?
• PacBio and Nanopore have the “problem” that only about 85% of reads are correct, so around 15% contain errors.
• PacBio reads often show insertions.
• Nanopore sequencing often shows deletions, especially when the same nucleotide repeats several times.
• Illumina sequencing is about 99% correct but can only generate short reads.
• Therefore Nanopore and Illumina are combined, and the sequences are compared. Nanopore provides long reads that are corrected by short Illumina reads.
What is the basic principle of Illumina sequencing?
• Many sequences can be sequenced in parallel.
• It is automatable.
• Read accuracy is high, about 99.3% per read.
a. What are the advantages?
b. What are the disadvantages?
• Only short reads can be generated.
• Sequencing quality decreases toward the end of the read.
Why does quality decrease with increasing read length in Illumina sequencing?
• A cluster consists of many copies of a sequence. When sequencing starts, nucleotides are incorporated.
• It can happen that not all DNA polymerases in the cluster incorporate a nucleotide, even though they should have.
• As a result, these sequences are no longer synchronized with the cluster.
• Over the course of sequencing, these errors accumulate, so by the end of 125 sequenced bases the signal is no longer as good as at the beginning.
What is an isoform, and what mechanism lies behind it?
• An isoform arises in eukaryotes through alternative splicing, meaning it is a possible transcript of a gene.
• Multiple proteins can arise from one transcript.
Last changedan hour ago