Describe what protein structure in 1D, 2D and 3D means (each max 1 bullet):
• 1D=string (secondary structure, sequence, accessibility),
• 2D: (inter-residue) distances/contacts,
• 3D: 3D coordinates of C-alpha/resdiue/atoms
Most methods predicting secondary structure in 3 states ... predict strands much worse than helix. Comment.
• traditional argument: strand less local than helix; seems intuitively reasonable, but IS wrong;
• real reason: in 4 words: fewer strands than others
How can sequencing mistakes challenge per-protein predictions?
• affect amino acid composition; might cut into motifs important for function (binding, location);• may impact MSA-generation; wrong DNA sequence -> exons may be overlooked, gene-structure missed, one nucleic acid shift could lead to 5% PIDE to the native protein, i.e. totally wrong sequence;
Where do you get evolutionary information from?
multiple sequence alignments/MSAs/profiles/PSSMs/protein families/databases of related proteins/PSI-BLAST or other comparison methods; comparison of related proteins
Why not enough data for neural net? What can be done?
• one-hot-encoding: 20 input features; 10020=2k free parameters for a single residue;• rule-of-thumb 10 samples per free parameter -> need 2k10 = 20k;• effective samples (LOWER of the two numbers!!!): 10k; since 10k<20k: not enough -> danger of over-training/over-fitting;• we could do it by reducing the number of input samples to some value <10 -> 10k free parameters (10*100 *10 =10k), i.e., free parameter * 10 = binding samples, i.e., just about ok;• alternative (only 1 pt): cut connections - only one point because this violates the assumption of full connectivity in ’you need 100’.
Fill in the blanks:
X_test
y_train
0.2
42
y
fit
predict
pred
What problem do pLMs address? What are embeddings?
• transfer learning; sequence-annotation gap; lack of labelled/annotated/experimental data;
• values from last hidden layer(s) of pLM learning to map language; contain grammar of proteins
How can we profit from pLMs for protein prediction?
transfer learning: use embeddings from pLMs as input to 2nd step methods trained to solve upstream tasks in supervised learning
Protein Language Models (pLMs) copy models from Natural Language Processing (NLP). Those learn grammar by putting words in their context in sentences. Name the three analogies for grammar word |sentence in pLMs (grammar: 1 bullet, word 1-2 words, sentence: word). (2 Points)
• Grammar: understanding of biophysics/protein,
• Word: amino acid/residue,
• Sentence: protein/domain
Difference between per-residue and per-protein embeddings?
• native: per-residue; pLM has, e.g., ProtT5 1024 units representing each residue, each position in a protein; per-residue = L1024 for protein of L; • per-protein: pooling, average over all residues in protein, i.e., 1024 numbers describe entire protein with L1024 per-residue embedding numbers; like amino acid composition
Using pLM per protein embeddings, subcellular location can be predicted at levels similar to the best methods using per-residue evolutionary information. Sketch a simple solution to improve over this ("FIT in space", defined as: has to be legible and fit into space provided, anything written outside or too small will be ignored). (3 Points)
input: running window of embeddings; pick most informative running window; learn which position is most informative;
Sketch two methods predicting Molecular Function from the GO (GeneOntology), one using only sequence similarity, the other using embeddings from protein Language Models (pLMs) ("FIT in space", defined as: has to be legible and fit into space provided, anything written outside or too small will be ignored) (4 Points)
• e.g. PSIBLAST against annotated sequences, transfer annotations, look for n-neighbors and agreeing annotation, a.s.f.• e.g. Determine n-nearest annotated sequences, transfer annotations, learn relationship between a particular embedding dimension and a GO term
Read counts from RNA sequencing experiments need to be normalized.Name the major two biases that we need to account for and state which of the following three normalization methods account for both of these biases:Counts per million (CPM), Transcripts per million (TPM), Fragments per kilobase million (FPKM)
Gene length, CPM: no, TPM: yes, FPKM: yes
Sequencing depth, CPM: yes, TPM: yes, FPKM: yes
DEseq2 is a popular method for differential gene expression analysis.Explain how DESeq2 is able to borrow information across genes to correct for dispersion estimates. Refer to MLE, prior mean and MAP (as shown in the legend on the top right) in your explanation.
DESeq2 performs dispersion shrinkage.First, the dispersion for each gene is calculated (MLE).Then the prior mean is fitted to these estimates (prior mean).Finally, an empirical Bayes approach shrinks gene-specific dispersion towards expected dispersion (MAP).MLE: maximum likelihood estimatesMAP: maximum a posteriori
RNA-seq count data follows a negative binomial distribution. Name the main difference of this distribution to the Poisson distribution and explain the source of this difference.
The negative binomial distribution accounts for overdispersion (variance > mean).This is because transcripts are present at slightly different levels in each sample.
STAR is a so-called splice-aware alignment tool. Which of the following statements is true:
The different parts of the read that are mapped separately are called ’beads’.
STAR is a pseudoalignment tool.
STAR uses maximal mappable prefixes for a partial alignment.
Kallisto uses pseudoalignment to produce a fast estimation of transcript expression values. Looking at the figure below, answer the following questions:
What type of graph is shown here?
What do the nodes represent?
What do the three colored paths represent?
What type of graph is shown here? Transcriptome De Brujin graph
What do the nodes represent? k-mers
What do the three colored paths represent? Three different transcripts
t1,t2: 1
t1,t3: 1
t1,t3,t4: 2
t3,t4: 1
t2: 1
Name one advantage and one disadvantage of single-cell sequencing compared to bulk RNA-sequencing:
Advantage: single cell resolution offers better understanding of cell type heterogeneity in tissues, ...Disadvantag: higher cost, lower sequencing depth, ...
Given that beads are loaded into droplets at Poisson rate µ.Which of the following formula is the correct one for estimating the cell duplication rate?
1−e−µ−µe−µ
1−e−µ
e−µ−µ
1− e−µ
× 1−e−µ−µe−µ
Which statements about UMIs in single-cell sequencing is false:
UMIs can be used to address PCR bias
UMIs are sequence barcodes of length 8 bp
UMIs are used for demultiplexing individual cells
How many unique barcodes can be generated with a DNA sequence of length L?
4^L
Complete the following sentences on barcode quality control with one word:Barcodes with a low count depth, few detected genes, and a high fraction of mitochondrial counts are indicative of ...cells.
Cells with unexpectedly high counts and a large number of detected genes may represent ...
Barcodes with a low count depth, few detected genes, and a high fraction of mitochondrial counts are indicative of dead cells.
Cells with unexpectedly high counts and a large number of detected genes may represent doublets.
Question part A: t-SNE is a widely used method in single-cell analysis workflows. Answer the following questions with the correct answer:
t-SNE is a ...
clustering
cell-type annotation
visualization
× visualization
Question part B: technique, which does not use
gradient descent
the Riemannian metric
the Kullback-Leibler divergence
internally.
× the Riemannian metric
Why does t-SNE use a t-distribution rather than a Gaussian distribution for modeling the neighborhood in low-dimensional space?
Because the heavier tails of this distribution allow it to better model larger distances.
The figure below shows a UMAP of cells, which form multiple distinct clusters in two-dimensional space.Name and explain tow pitfalls of transformation methods such as UMAP or t-SNE. Hint: think about the relationship between the clusters shown here.
cluster size is meaningless
distances between clusters are meaningless
patterns can be missleading
Name one example of a small and one example of long non-coding RNA class and describe them in one sentence:
miRNA, siRNA: RNA interference
circRNA, lncRNA, eRNA, pseudogenes, lincRNA, NAT: regulatory functions, artifacts,
Which of the following statements is true?
The miRNA is incorporated into the RNA-induced silencing complex together with an argonaute protein that facilitates target recognition
The miRNA binds directly to RNA that have matching binding sites in their 3’ UTR. The miRNA-RNA duplex is called RNA-induced silencing complex
The miRNA binds directly to the DNA forming the RNA-silencing complex and prevents the transcription of RNA
× The miRNA is incorporated into the RNA-induced silencing complex together with an argonaute protein that facilitates target recognition
Name three features / properties commonly used by computational miRNA prediction tools:
Seed sequence match
Sequence conservation
Site accessibility
Minimum free energy/binding energy
miRNA prediction tools are known to suffer from a high false positive rate. Give one reason for this and explain in one sentence.
sequence based methods do not consider tissues oder cell type specific expression.
What is the competing endogenous RNA hypothesis. Describe in 1–2 sentences.
It suggests that transcripts can act as microRNA sponges, thus influencing expression levels of otherwise suppressed transcripts.
SPONGE is a network inference tool for constructing ceRNA gene-regulatory networks. It is neither computationally feasible nor meaningful to test all possible interactions. Name two strategies to identify the most promising gene pairs to test.
miRNA target site prediction / seed match etc
gene-miRNA regression coefficient negative
What is a method to identify interesting nodes in a network?
Degree / any centralities
In tandem mass spectrometry, so called MS1 and MS2 (or MS/MS) spectra are acquired. Describe the purpose of the two types of spectra, what information they contain, and roughly how often (or the frequency) those spectra are acquired.
slide 74-80
The figure below shows an enlarged portion of an MS1 scan.Based on the shown isotope envelope, estimate the length of the underlying peptide(s). Assume that the average mass of amino acids is 100 Da. Explain why two different (whole number) peptide lengths are possible for the depicted envelope. Explain briefly why you think that 1 or 2 peptides are present here.
1 peptide precursor with charge 4 (0.25 m/z difference between isotopes) –> 350 m/z * 4 z / 100 Da/AA = 14 AAs
Another peptide precursor with charge 2 (0.5 m/z difference between isotopes) –> 350*2/100 = 7 AAs; Second peptide visible due to the distorted isotope pattern (every other peak shows higher intensity, which we would not expect to see if only a single isotope pattern is present).
Draw a tandem mass spectrum of a peptide of your choice (masses do not have to be correct) with axis-labels and peak-annotations (fragments + number). Describe what one can see in it.
• A tandem mass spectrum (MS2) is the same as any mass spectrum, a list of tuples (m/z,
intensity) represented by a barchart (x-axis m/z, y-axis intensity), with the difference that this
spectrum was acquired from a specific analyte (e.g. peptide) from MS1 and thus has a precursor
mass attached from which it originated from.
• It records the fragmentation characteristics of the analyte (e.g. peptide) in the form of fragment
ions.
• The image schould contain an x-axis m/z, y-axis intensity, peak which show some sort of
laddering. Peaks are annotated with y n and b n, starting (low m/z) with 1 and increasing ion
numbers with increasing m/z. Some peaks may be skipped. The last y- and b-ion can only be
length(sequence)-1
What is the difference between digestion and fragmentation in a bottom-up mass spectrometry-based experiment.
Digestion: Proteases (e.g. trypsin) are used to enzymatically cleave proteins into peptides. While increasing complexity, the physico-chemical properties of peptides are more homogenous than those of entire proteins, which allows us to analyze them using MS.
Fragmentation: Fragmentation happens on peptide level and induces peptide-backbone breakage resulting in (y- and b-) fragments that enable us to determine the primary sequence of the peptide.
When matching theoretical peaks (e.g. fragments) to observed peaks in a mass spectrum, the observed m/z is subject to variation. If you only had the choice to build a mass analyzer which is able to determine the m/z of analytes with either high accuracy or high precision, which one would you prefer for the observed m/z? Describe the difference between the two terms (one sentence each) and provide your reasoning for your choice which one is preferred (at most 2 sentences).
Slide 32
Describe one reason why de novo peptide sequencing is – compared to e.g. database searching – more difficult and is often not applied in practice.
De novo peptide sequencing is more difficult because it relies solely on the observed spectrum without any reference, making it prone to errors due to missing or ambiguous fragment peaks.
Which of the following statements is true about database search engine used in proteomics:
Mimics the wet-lab workflow in silico to assign peptide sequences to spectra.
Searches the best matching spectrum in a library of previously identified spectra.
Generates sequence tags from experimental spectra which are then searched against a database to identify proteins.
× Mimics the wet-lab workflow in silico to assign peptide sequences to spectra.
Explain (1 sentence each) the five main steps of a database search engine for analyzing bottom-up tandem mass spectra acquired in a proteomics experiment. Assume the database digest is done already. Give one reason, why the standard approach may yield incorrect identifications.
Slide 47
peptide candidate selection based on precursor m/z
in silico fragmentation based on fragmentation method (b- and y-ion m/z calculation)
merging of spectra based on tolerance of the mass analyzer
scoring merged spectra based on e.g. the number of matching (theoretical/observed) fragments or explained intensity
selection of the highest scoring candidate
Briefly describe the FDR approach in proteomics? Specifically mention: Why do we need decoys? How can they be generated? How do decoys simplify the FDR formula?
FDR estimation is done using the target-decoy approach.
Decoys are proteins which are known to be absent in the sample which can be generated using e.g. shuffling of AAs in target sequence, reversing target sequence or randomly generated sequences.
The number and properties of decoy proteins has to match those of the target proteins, such that false positives have a 50/50 chance to be matched in either the target or decoy space.
The method is used to control FP (type I errors). In proteomics, FPs would give rise to incorrectly identified peptides and thus proteins, which are actually not present. This may lead to generating false hypothesis for e.g. biomarker selection.
Why do we have to recalculate FDR at peptide- and protein-level even though at spectrum-level, we have already filtered our final list at 1% FDR. Give an example to illustrate your reasoning.
Accumulation of error, Slide 84
Metaproteomics is an umbrella term for an experimental approaches to study all proteins in microbial communities and microbiomes from environmental sources. An example is the human gut, which is estimated to host about 10¹³–10¹⁴ microbial cells from thousands of different bacterial strains, all of which share large portions of their genome, resulting in a very large search space of potentially millions of protein entries in fasta file. Describe one computational problem you foresee when analyzing a mass spectrometry-based metaproteomics experiment which can substantially impair the number of confidently (<1% FDR) identified peptides or proteins.
Base problem: a large number of proteins (and thus peptides) are shared across the thousands of strains. Minor sequence variations will result in very similar peptides.
Option 1: Scoring candidate peptides may fail because of the large number of peptides which may only exhibit minor sequence variations (e.g. permutations)
Option 2: FDR estimation may fail because we generate decoys by e.g. reversing target proteins. The peptides (given the very large search space) may be too similar to target to be able to differentiate those from another. The target/decoy score distribution may overlap significantly and thus no reasonable FDR estimation can be done.
Option 3: Protein inference will be a problem due to the large number of peptides shared across proteins from different strains. We may only identify very large proteins groups and are unable to pinpoint a particular protein which may be present.
Other options are also possible.
Which of the following statements is true about the averagine model:
Estimate the mass of peptide based on its m/z and charge.
Estimate the amino acid sequence of a peptide.
Estimate the relative intensities of the isotopes when the sequence of the peptide is unknown.
× Estimate the relative intensities of the isotopes when the sequence of the peptide is unknown.
Name a similarity/dissimilarity measure that can be used for comparing two spectra. Explain briefly how it works and what it quantifies. Explain why, and if so which, normalization is or is not necessary for this measure.
e.g. Normalized spectral contrast angle. Quantifies angle between two vectors. No normalization necessary since it has a p-2 norm in it already.
e.g. Euclidean distance. Quantifies distance between "arrow tips". Normalization necessary, e.g. p-2, to avoid quantifying difference in length of the vector (spectra).
Cos, Jaccard, ...
Why do mass spectrometrists prefer to use an indexed (or relative) retention time rather than the (absolute) retention time? Describe how indexed retention time can be calculated for peptides. Describe one preferred property a peptide should have in order to use it for indexed retention time calculation.
slide 39
Why is the automatic identification of metabolites – similar to what is done in database searching for peptides in proteomics – so difficult? Provide and briefly describe (max. 2 sentences each) two reasons.
slide 100
Briefly describe a deep learning method/tool/model which is used in computational mass spectrometry. Why do you think deep learning has advantages over classic approaches in this use case? Why do you think deep learning has a disadvantage or may impair the analysis in this case?
slides 58-64
Zuletzt geändertvor 12 Tagen