Argue for or against the assumption that it is possible at all to predict protein function from sequence (only first 3 bullets count, 4pts)
Pro:
Dogma: Sequence determines Structure determines Function
Proteins with a high sequence similarity OFTEN have a similar function
Against:
Function depends on possible complex partner, tissue / cell localization, chemical environment -> things that arent directly related to sequence
List three reasons why the same protein may have more than one molecular function (3 bullets, 3pts).
Multi-Domain proteins -> each domain different binding partner -> different function
Post-translational modifications such as phosphorylation can change their activity or specificity
Moonlighting proteins: Some proteins also have a secondary function that may be performed in a different cellular location or under different conditions
What problem could be caused by a moonlighting protein to a method predicting function? (Max 1 sentence, 1pt)
Is this related to the multi-domain problem (yes/no + argument, 2pts; total 3pts)
The prediction method could focus on one function of the protein, and ignore the other
No, this is not related to the multi-domain problem, because moonlighting proteins can perform their different functions using the same domain.
What is the main idea behind the concept of homology-based inference (HBI; graph and/or few bullets, 2pts).
Claim "most database annotations are based on HBI". Comment (correct: yes, now + few bullets, 2pts).
Assume you want to predict location in 10 states. You can use HBI, what could you gain from ML that HBI cannot get (bullets, 2-3 pts).
Main Idea: If sequence similarty of 2 proteins Q (query protein with unknown annotation) and E (protein with experimental annotation) is higher than threshold T, then copy annotation of E to Q
The claim is true!
HBI is cost efficient
HBI is relatively easy
ML can:
do noveltyprediction, wheras HBI is unuseable if no similar sequences with annotations are known
Name the three different hierarchies in the GO, and provide an example for each (not number but name, 3pts).
List three challenges for developing a method that predicts molecular function (in terms of GO numbers; three bullets, 3 pts; total 6 pts).
Hierachies:
Biological process -> cell cycle control
Molecular function -> RNA binding
cellular component -> activin complex
Challenges:
Non-trivial data structure (DAG)
constant changes of labels -> quickly outdated
Little high-quality/experimental data available
You want to develop a method predicting protein sub cellular location in 10 states. You consider using a simple artificial neural network (single hidden layer, ANN) or a Support Vector Machine (SVM).
Name two advantages of SVM over ANN (bullets, 2pts).
One solution of your problem might be a hierarchical system (sketch how), name two disadvantages of such a hierarchy (bullets, 2pts; total 5 pts).
Advantages of SVM over ANN:
SVMs require less data
SVMs models are more lightweight
SVM is faster to train
Chain SVMs classifier by predicting the localization in each level separately from top to bottoml. Disadvantages:
Errors in higher levels of the hierarchy propagate to the lower leves and cant be corrected
SVM zooms into experimental mistakes
Protein location is regulated by sorting mechanisms. For some of those, the "signals" on the sequence relevant for sorting are somehow known.
Two particular examples are
a) NLS (nuclear localization signal)
b) signal peptides.
Give two reasons (bullets) why a) is difficult to predict by ML and b) is not (2pts)
Compared to signal peptides, NLS are difficult to identify / predict because they are not conserved -> too different from each other
More data available for signal peptides
Describe the difference between per-protein and per-residue prediction (two bullets).
Embeddings from pLM are intrinsically either a) per-protein, b) per residue and c) both mixed. Pick your answer and explain
Different levels of granularity!
per-protein: Are made at the level of the entire protein, and are typically used to predict properties such as PPI or protein function.
Per-residue predictions: Are made at the level of individual amino acid residues within a protein. They are typically used to predict properties such as residue-residue contacts
Both mixed would generate an embedding for the whole protein sequence and also for each residue position of the protein -> embeddings can complement each other and capture more information
You develop a protein language model (pLM).
Argue why the loss function is NOT the best way to assess how valuable the information learned by the pLM (the embeddings) are.
Name two other ways (W1 and W2) of assessing embeddings and compare their value ("W1 more informative than W2, because", < 3 bullets, total 5pts).
The loss function does not measure the quality of the embeddings, but the overall performance of the model using it -> not a direct measure for the embeddings.
W1: Evaluating performance for several, different tasks
W2: Visualize the embeddings using techniques such as t-SNE, UMAP
W1 is more informative than W2 because W2 is not directly linked to any performance metric and hard to quantify, while W1 has implications about practicality and biological meaning stored in the embedding
You want to develop a method predicting physical protein-protein interactions (PPIs). In your data set you distinguish between three cases:
C1: test A-B (A interacts with B; A and B in training/validation set but NOT their interactions)
C2: test A-B (A interacts with B; either A or B in training/validation set but NOT their interactions)
C3: test A-B (A interacts with B; neither A or B in training/validation).
Give three reasons why it is so difficult to predict C3 (bullets, 3pts).
The model will inherantly predict binding more often for those it has seen in training, and not binding for any novel proteins
When training and evaluating the model using C1, the model will focus on frequent binders, i.e. proteins with a great deal of interactions, therefore using C3, the model will be forced to put more attention on the features that are key for PPi (which is more difficult)
C3 is similar to predict and unseen class
What challenges may arise when creating a data set for the prediction of physical PPIs) (3 bullets, 3pts)
How could you address those challenges (shorter, 2pts)?
Why is the Big Fantastic Database (BFD) with 2.7 billion protein sequences not used for PPI prediction?
Data availability: There is often a scarcity of experimental data on PPIs, and the data that is available may be incomplete or biased.
Data preprocessing: there can be a lot of noise in the data, and it can be challenging to preprocess the data in a way that removes noise while preserving the true PPIs
Data imbalance: PPI datasets can be highly imbalanced, with a large number of non-interacting protein pairs compared to interacting pairs.
Solutions:
Data availability: gather data from as many sources as possible, including both experimental and computational data
Data preprocessing: use techniques such as feature selection and feature scaling to remove noise and improve the quality of the data
Data imbalance: oversampling the minority class, undersampling the majority class, or using cost-sensitive learning can be used
BFD: not used for PPI because it is mostly unannotated data, and there is no experimental “truth” to validate results against
What is the main idea behind the Zipf law/the Zipf function (giving the function is not necessary)?
You measure the distributions of the size of protein families (how many families with N numbers, family=all with same enzymatic activity). You find that the function is a constant linear (same number for most N): what may that suggest? (4-6pts)
Main idea:
Describes the tendency in many fields (e.g. biology, sociology, physics etc. ) that the rank-frequency distribution follows a power-law if the rank is linked to a limited resource in a competitive system
One would expect the distribution of protein family size to follow Zipf’s law -> Meaning that the most families have only a few family members wheras a few families have alot.
If the distribution of family size is linear, then the dataset probably contains a lot of redundant sequences and wasnt redundancy reduced, thus not following Zipf’s law
How to rank methods [not formulated, but will be in exam probably]
Look at StdErr. If Intervals overlap, then two methods are not significantly different.
For multiple methods: Count how many times each method is significantly better (or worse) than the other methods.
List two problems for using per-residue predictions to predict sub cellular location (bullets, 2pts).
Per-residue predictions alone may not account for
contextual information, such as interactions and dependencies between surrounding residues
structural information
which both are often crucial for predicting sub-cellular location
You have three methods predicting the effects of SAVs.
For all SAVs experimentally known before DMS (Deep Mutational Scanning, large scale experimental study of SAVs), all three methods had exactly the same performance (within error bars).
Thus, the three perform similarly for predicting the effects in all human SAVs? (T/F + explain < 3 bullets, 3pts)
False.
The performance of the methods can depend on the specific dataset, features, and parameters used
They were probably trained on and optimized for a specific set of SAVs and it’s not guaranteed that they perform as well on different datasets
What is the main idea behind the concept of evolutionary couplings (best provide a sketch along a few bullet points; 3pts)
Main idea: Evolutionary couplings are 2 residues not necessarily locally connected that evolve/mutate/change together. When one changes, the other always does, too.
Use: Use evolutionary couplings to predict that two residues are in contact in 3D space
EAT is a connotation for embedding-based transfer of annotations.
What is the difference between HBI and EAT? (maximal one bullet, 1pt)
Why could EAT become a new baseline for prediction methods, e.g., to decide on what is redundancy?
Instead of sequence similarity, EAT is based on the Euclidean distance between single protein sequence representations (embeddings) from protein Language Models (pLMs) to infer protein similarity.
EAT should be less biased than HBI and could potentially capture more information
You compare two existing methods (O1 and O2) to a new method N. You find that N>O1 and N>O2 (">" means "Q2 statistically significant higher").
Nevertheless, there could be ways in which you could use O1 and O2.
Provide two possible examples explaining how O1 and O2 (or O1 OR O2) could complement N (4pts).
Averaging over all three predictions could improve performance, since each might focus on different features (the predictions mustnt correlate!)
Use O1 and O2 as a complement to N, for data poits for which 01 and/or o2 predict correctly and with high confidence
Describe (or invent) a method predicting binding residues better than random. (Just naming an existing method insufficient, instead: describe: input/output, method in between. TOTAL 9 PTS
Input: Known binding protein pairs and their binding residues
Output: Probability of binding for residues
Method:
Align the proteins using MSA to identify conserved regions likely to be involved in binding.
Use sequence-based prediction tools to identify binding residues on each protein
Use structure-based prediction tools to predict the binding interface of the two proteins.
Combine the results from steps 2 and 3 and train a machine-learning model using known binding protein pairs and their binding residues
AlphaFold2 is arguably the biggest breakthrough in protein structure prediction ever. Why is it so difficult for AlphaFold2 to predict the effects of SAVs (max 2 bullet points, 2pts).
Alphafold2 is too slow to predicts SAVs at scale.
Alphafold2 uses MSAs and thus is not sensitive to SAVs.
(A single altered sequence gets drowned out in the noise of a huge MSA table)
The prediction method could misclassify the protein because of the different labels of its functions.
Last changed2 years ago