Explain what kind of protein structure (in 1D, 2D or 3D) inter-residue contacts are and why?
Inter-residue contacts are a 2D protein structure. This is because they describe a contact between two residues.
Helices are stabilized by hydrogen-bond formation, typically between residues i and i+4. Does this mean secondary structure to be a 2D feature of structure (<2 bullets/sentences).
No, secondary structure is a 1D feature. It can be expressed as a string or a single character per residue.
Q3 measures the percentage of residues correctly predicted in one of 3 states (H: helix, E: strand, O: other). Methods predicting H|E|O equally well may have a lower Q3 than those predicting O best. Why? (1-2 sentences)*
This can happen because the other methods are biased towards O, as they were trained on an unbalanced dataset where O is more prevalent. Alternatively, if the methods predicting H/E better were trained on a balanced dataset that differs from the real distribution, the average performance could also go down.
Experimental high-resolution structures are know for fewer than 10% of the ~ 20k human proteins. Give three reasons (3 bullets)
Answer:
The protein has not attracted special interest yet.
All attempts to determine the structure have failed.
The protein is disordered and has no defined structure.
Why do methods using one-hot encoding perform worse than those using evolutionary information (max. 2 bullets/sentences)?*
One-hot encoding methods cannot capture the specific importance of every position in a sequence.
They cannot consider information from related sequences and can be biased by individual variations.
THE breakthrough in protein prediction originated from using evolutionary information (originally in 1992 to predict secondary structure). Evolutionary information is typically stored as a PSSM. What is the precursor data to derive a PSSM (1 bullet).*
MSA
You want to develop a method that predicts binding residues (e.g. enzymatic activity and DNA-binding). Your entire data set of proteins with experimentally known binding residues amounts to 500 sequence-unique proteins with a total of 6,000 binding and 54,000 non-binding residues. You can only use a simple artificial neural network (of the feed-forward style) with one hidden layer, but due to the complexity of your problem you need at least 100 hidden units. Thus, even a simple one-hot encoding (or evolutionary information) for a single residue with 20 units is not supported by the data. Explain why. What could you do, instead, how could you solve this?*
One-hot encoding with 20 input features and 100 hidden units would result in 20∗100=2,000 free parameters. A common rule of thumb is to have 10 samples per free parameter, which means you would need
2,000∗10=20,000 samples. However, you only have 6,000 binding residues (the lower of the two numbers). Since 6,000 is less than 20,000, there is a danger of over-training or over-fitting the model.
Solution: You could reduce the number of input nodes to a value less than 6. You could also express each residue by calculating different scores based on evolutionary coupling, distance, and solvent accessibility, or by using biophysical properties.
What problem do pLMs address (short bullet)? What are embeddings from pLMs (short bullet)?*
Answer: pLMs address the problems of transfer learning, the sequence-annotation gap, and the lack of labeled/annotated/experimental data. Embeddings are values from the last hidden layer(s) of the pLM that learn to map language and contain the "grammar" of proteins.
Protein Language Models (pLMs) copy models from Natural Language Processing (NLP). Those learn grammar by putting words in their context in sentences. Name the three analogies for grammar word sentence in pLMs (grammar: 1 bullet, word: 1-2 words, sentence: word).*
Grammar: Understanding of biophysics, protein structure/function/evolution/constraints.
Word: Amino acid/residue.
Sentence: Protein/domain.
Where do we get per-residue and per-protein embeddings from? What are their dimensions?*
Answer: Per-residue embeddings are native to the pLM. For a protein of length L, the per-residue embedding has a dimension of
L∗1024 (for a model like ProtT5 with 1024 units). A per-protein embedding is obtained by "pooling" or averaging the per-residue embeddings over all residues in the protein, resulting in a single vector of 1024 numbers that describes the entire protein.
How do we use pLMs for protein prediction tasks(<2 sentences/telegram style)?*
We use pLMs to encode proteins, or we use the embeddings from pLMs as features that are then used as input for a second-step method trained to solve upstream tasks in supervised learning
You want to use a neural network with two output nodes for classification. How do you get a single, binary prediction? Could you further benefit from the two output values? (2 bullets/sentences)
-take the larger value as result
-give prediction confidence by comparing the 2 Output values
In an artificial neural network we typically initialize the weights with random numbers. What would happen if all initial weights had the same value and in which step of the learning process could this cause problems?
Answer: All the neurons would behave the same, making the network act like a single neuron. This would cause problems during the
backpropagation of the error, as there would be no way to update the weights differently.
Why do we need an activation function in neural network node? What would be the consequence for a multi-layer neural network if the activation function is missing?
-An activation function introduces non-linearity into the network.
-If missing: all hidden layers could be collapsed into a single layer.
What is the primary goal of genome assembly in a sequencing project?
Kamal
reconstruct the original sequence of a genome by arranging sequenced DNA fragments in the correct order, creating a (near) complete representation of the genome.
What role do repeat regions play in complicating genome assembly, and what strategies or technologies can be used to address these challenges?
Answer: Repeat regions are sequences that occur multiple times in the genome, making it difficult to determine their exact placement and orientation, which can cause misassemblies or gaps in the final assembly.
Long-read sequencing (e.g., PacBio or Oxford Nanopore): Longer reads can span entire repeat regions, allowing for accurate placement.
Paired-end reads: Provide information about the distance between two sequences, helping to resolve repeats by linking regions on either side of the repeat.
Scaffolding with mate-pair reads: These can span even larger distances, providing more support for assembly across repeat regions.
How does paired end reads help with genome and transcriptome assemblies? Why do we consider using them instead of single end reads?
Answer: Paired-end reads provide information about the distance and orientation between two reads from opposite ends of the same DNA fragment. allows for more accurate mapping to a reference genome. They are preferred over single-end reads because they can more effectively identify insertions, deletions, and rearrangements.
Describe at least 2 differences between the Overlap Consensus Layout vs De Bruijn Graph approach to building genomes.
Read Length Handling: OLC uses whole reads to find overlaps, De Bruijn graphs use k-mers (substrings of reads).
Memory Usage: OLC is more memory-intensive (for smaller datasets), while De Bruijn graphs are more memory-efficient (better to large datasets).
How does long-read sequencing (e.g., ISO-seq) enhance the annotation of splice variants (different isoforms/mRNAs of the same gene), and why is short-read sequencing limited in this context?
Long-read transcriptomes can sequence entire mRNA transcripts from end to end.
-> allows for the direct observation of splice variants, capturing the full diversity of splicing events in a single read.
Short reads, due to their limited length, often require the reconstruction of splicing events from multiple fragmented reads
-> can miss complex alternative splicing events.
What is one of the key differences between microarrays and RNA-Seq in terms of their detection methods and capabilities?
Answer: One key difference is dynamic range; RNA-Seq has a larger dynamic range than microarrays, allowing for more precise measurement of both high and low expressed genes without the saturation limits of microarrays.
A research team has assembled a reference genome for tomato (Solanum lycopersicum) and now seeks to identify the locations of genes in the genome. To aid this, they extracted RNA from 8 distinct plant samples (tissues and developmental stages) and plan to sequence it. How can RNA sequencing (RNA-Seq) contribute to identifying gene locations and structures within the genome?
Answer: RNA-Seq captures the transcripts, which can be mapped back to the reference genome to identify gene locations. By aligning the RNA reads to the genome, researchers can determine exon-intron boundaries and find actively transcribed regions, which helps annotate gene locations.
How does Iso-Seq help with annotating splice junction variations?
Answer: RNA-Seq (including Iso-Seq) captures the full diversity of transcript variants, including alternative splicing events. By sequencing RNA, researchers can identify different splice junctions where exons are joined in various combinations. This reveals the presence of multiple isoforms of the same gene, allowing for the annotation of splice variants that might be missed by other methods like short-read sequencing or microarrays.
Why did the researchers sample RNA from eight distinct samples instead of only one?
Answer: Genes can be expressed differently in various tissues or at different developmental stages.
-> the researchers maximize their chances of capturing a broad range of gene expression and detecting tissue- or stage-specific transcripts.
How do you differentiate technical replicates from biological replicates? Why are both needed?
Answer: Technical replicates are repeated measurements of the same biological sample to account for variability in the experimental technique.
Biological replicates involve independent samples from different biological subjects or conditions to capture natural biological variation.
technical replicates ensure accuracy and consistency in the experimental method, while biological replicates ensure the findings are biologically meaningful and not limited to a single sample.
Given this following transcript compatibility matrix (transcripts t, reads r), define suitable equivalence classes and determine their counts:
Class 1: Maps to t2 only (read r1), Count: 1.
Class 2: Maps to t1 and t5 (reads r2, r3), Count: 2.
Class 3: Maps to t2, t3, and t4 (read r4), Count: 1.
Class 4: Maps to t1, t3, t4, and t5 (read r5), Count: 1
When researchers compare gene expression between samples or genes, they need to normalize the count data. Name three normalization methods you know and name the biases they account for.
Answer: Normalization methods include Reads Per Kilobase of transcript per Million mapped reads (RPKM), Fragments Per Kilobase of transcript per Million mapped reads (FPKM), and Transcripts Per Million (TPM), as well as DESeq2 Normalization. These methods account for biases such as
sequencing depth and gene length.
What can be at least two reasons (not sequencing related) why the same gene has different expression levels between two samples?
Biological differences: Genes can be expressed differently in various tissues or at different developmental stages. For example, a gene in muscle tissue might have different expression than the same gene in brain tissue.
Differences in experimental conditions: This could include factors like a treatment vs. a control setup.
Regulatory mechanisms: Variations in transcriptional regulation can also lead to different expression levels.
What is the format called in which you get RNA-Seq data once it has been sequenced? How many lines represent each read? What do each of these lines represent?
Answer: The format is typically FASTQ. Each read is represented by
four lines:
Line 1: Contains the read identifier and an optional description.
Line 2: Contains the raw sequence.
Line 3: Begins with a '+' and can optionally repeat the identifier.
Line 4: Contains the quality scores for the sequence, encoded as ASCII characters.
The case for discovering miRNA binding site targets have a computational approach and an experimental approach. If Sensitivity measures the proportion of actual positives that are correctly identified as such; and Specificity measures the proportion of actual negatives that are correctly identified as such. Argue for cases A B C & D.
HITS-CLIP/CLIP-seq: An experimental technique that identifies binding sites of RNA-binding proteins on RNA molecules.
Specificity (A): Measures how well HITS-CLIP/CLIP-seq correctly identifies non-binding sites (fewer false positives).
Sensitivity (B): Measures how well the method identifies true binding interactions (good at capturing true positives).
Computational Prediction: Algorithms that predict RNA-binding sites based on models rather than experimental data.
Specificity (C): Evaluates how accurately computational predictions avoid false positives, meaning they are effective at correctly identifying negatives.
Sensitivity (D): Assesses the ability of computational methods to identify true binding sites; a higher value means better performance at detecting true interactions.
Which enzyme is responsible for the processing of pri-miRNAs into mature mi-RNAs?
A) Dicer B) Drosha C) Exportin 5 D) Argonaute
Answer: A) Dicer.
Describe the elements of an ideal scRNA-Seq experiment.
Answer: An ideal scRNA-Seq experiment would be universal in terms of cell size, type, and state. It would involve in situ measurements and have no minimum input number of cells to be assayed. It would have a 100% capture rate (every cell is assayed), and 100% sensitivity (every transcript in every cell is detected). Every transcript would be identified by its full-length sequence, and transcripts would be correctly assigned to cells (no doublets). The experiment would also be cost-effective, easy to use, and open source.
Describe the difference between Cell duplication rate and Barcode Collision rate.
Cell duplication occurs when a single cell is sampled or sequenced more than once, leading to data redundancy and potential over-representation of certain cell types. This arises from errors in physical sampling and library preparation.
Barcode collision happens when different cells receive the same barcode, causing them to be misidentified as a single cell. This error originates from the experimental design and limitations of the barcode system, such as insufficient barcode diversity or high cell loading densities.
Which downstream analysis is shown here? Describe how and why this analysis gives us insight into our data.
The downstream analysis shown is ordering cells in pseudotime. After clustering cells based on their gene expression profiles, this method orders them along developmental trajectories, providing insight into the temporal progression of cell states. It allows for branch point analysis to identify where cells make decisions about their fate and for meta-stable state analysis to find more or less stable states within the trajectories.
What is a tandem mass spectrum and what is recorded in it? Draw an exemplary tandem mass spectrum of a peptide of your choice (masses do not have to correspond to real amino acids but positions of peaks need to be logical) with axis-labels and peak-annotations (fragments + number).*
A tandem mass spectrum (MS2) is a mass spectrum of a specific analyte (e.g., a peptide) that was selected from an MS1 scan. It records the fragmentation characteristics of the analyte in the form of fragment ions. The spectrum is typically a barchart with the x-axis representing
m/z and the y-axis representing intensity. Peaks are annotated with
yn and bn ions, starting with 1 at a low m/z and increasing in number with increasing m/z
Describe in 2 sentences what specific pattern of peaks is visible in MS1 for a peptide and why.*
• How: Visible as isotope pattern (envelope) in MS1 where isotope peaks are present 1/z to the
right of the monoisotopic peak. Intensity depends on the number of heavy atoms in the molecule.
• Why: The isotope pattern allows us to calculate (estimate) the neutral mass of the analyte.
Otherwise, we would only have the ratio of m/z.
What is the main difference between MS1 and MS2 spectra in tandem mass spectrometry and why are they both necessary in mass spectrometry-based proteomics?*
MS1:
-intact peptide masses
-no fragmentation took place
-required to get the peptide mass.
MS2:
-Fragmentation spectrum showing ladder of peaks
-Used to identify peptides (sequencing).
The figure below shows an enlarged portion of an MS1 scan. Assume that the average mass of amino acids is 100 Da. Based on the Figure, estimate the length of the underlying major peptide. Does the depicted isotope envelope look as expected for a peptide? If not, what may be the reason?*
The distance between isotopes in the figure is approximately
0.25m/z, which indicates a charge (z) of 4 (1/z=0.25Da→z=4).
The peptide's mass is estimated as
350m/z∗4z=1400Da.
Assuming an average amino acid mass of 100 Da, the peptide length is approximately
1400/100=14 amino acids.
The depicted isotope envelope does
not look as expected for a single peptide. The presence of a distorted pattern where every other peak has a higher intensity suggests that a second peptide, likely with a charge of 2, is also present.
Name and explain (1 sentence each) the main steps of a database search engine for analyzing bottom-up mass spectrometry-based proteomics data.*
In silico digestion based on protease.
Peptide candidate selection based on the precursor's m/z value.
In silico fragmentation based on fragmentation method.
Merging of spectra based on mass tolerance
Scoring: The merged spectra based on number of matched fragments (for ex)
Candidate selection: The peptide candidate with the highest score is selected as the identified peptide.
Describe how the target-decoy approach is used in proteomics to estimate FDR. Which error is controlled by this approach and why is this necessary?*
Answer: The target-decoy approach is used to estimate the FDR. Decoys are proteins known to be absent from the sample, generated by methods like shuffling or reversing the target protein sequences. The number and properties of the decoy proteins must match the target proteins so that false positives have a 50/50 chance of being matched in either the target or decoy space. This method is used to control
False Positives (Type I errors). This is necessary because incorrectly identified peptides and proteins can lead to generating false hypotheses, for instance, in biomarker selection.
Describe in 2 sentences why predicting peptide properties, such as the retention time or fragment ion intensities, is useful for data analysis. In other terms, which problem is addressed that classic database search engines have when attempting to assign a peptide sequence to a spectrum?*
-Classic search engines dont use RT or intensity
-> Hard to differentiate similar peptides
-with additional matching Features, correct and incorrect Matches can be seperated better
Metaproteomics is an umbrella term for experimental approaches to study all proteins in microbial communities and microbiomes. Most commonly, such samples contain hundreds of different bacterial strains, all of which share large portions of their coding genome. Describe one major problem you foresee when analyzing data from such a sample that is substantially impairing the confident (<1% FDR) identification of specific proteins.
problem: large number of proteins and their corresponding peptides are shared across the numerous strains.
-> protein inference becomes a challenge. Due to the large number of shared peptides, it becomes difficult to confidently spot a particular protein from a specific strain, often resulting in the identification of very large protein groups instead.
Assume you were told that a researcher filtered the results of a proteomics experiment only based on a 1% peptide FDR threshold. What value or range of FDR would you expect on protein level as when not applying any additional filters? Briefly (max 4 sentence), explain your answer and why you came to this solution.*
The protein-level FDR would most likely be higher than the 1% peptide FDR threshold. This is because it is common for several different peptide sequences to belong to the same protein. When you have multiple peptides associated with a single protein, if just one of those peptides is a false positive (which is allowed at a 1% FDR), the entire protein is also considered a false positive. This accumulation of false positives at the protein level increases the overall FDR.
Name a similarity measure that can be used for comparing two spectra. Explain briefly how it works and what it quantifies. Explain why, and if so which, normalization is necessary.*
Euclidean distance. It quantifies the distance between the "arrow tips" of two vectors. Normalization, such as p-2 normalization, is necessary to avoid quantifying differnces in length of the vector.
Why is the automatic identification of metabolites similar to what is done in database searching for peptides in proteomics - so difficult? Provide and briefly describe (max. 2 sentences each) two reasons.*
(chat)
Large chemical diversity: Unlike peptides, which are made from a limited set of 20 amino acids, metabolites have a vast and diverse chemical space. This makes it challenging to create a comprehensive database of all possible metabolites.
Complex fragmentation patterns: Metabolites often have unpredictable and complex fragmentation patterns compared to the well-understood b- and y-ions of peptides. This makes it difficult to match an observed mass spectrum to a theoretical one.
Briefly describe a deep learning method/tool/model which is used in computational mass spectrometry. Why do you think deep learning has advantages over classic approaches in this case? Why do you think deep learning has a disadvantage or may impair the analysis in general?*
Prosit: predicts MS/MS and RT
Adv.: can capture complex patterns and give additional Info for prot. identification
Disadv.: dependent on clean and a lot of data
Last changed18 days ago