What is a domain in a protein?
= 3D unit that folds on its own, without other parts of the protein
Define protein structure & protein function (each max 3 bullets) 3 pts
Protein Structure:
amino acid sequence defines structure
= shape of a protein, in which it folds in its natural ‘unbothered’ state
-> structure defines function
there are 4 levels of protein structure with different levels of complexity
primary (AS),
secondary (alpha, beta etc),
tertiary (with bonds),
quaternary (domains)
Protein Function:
diverse: chemical, biochemical, cellular, developmental, physiological, genetic
depending on the perspective: the function of a protein is anything that happens to or through a protein
Protein structure in 1D, 2D, 3D: describe (each max 1 bullet) 3 pts
1D: amino acid sequence (e.g. MAYT…) or string of secondary structure states
-> (explanation for us: because it is 1D representable -> e.g. HHCCEEH…) (alpha helix, beta sheet, coil, 2D != secondary structure)
2D: inter-residue distances and visualization in a graph (ramachandran plot/distance map)
3D: coordinates of protein structure and visualization
Speculate: Why do most proteins have more than one structural domain?
enhanced functionality:
different domains within that/those protein/s can perform multiple task at the same time or sequentially
e.g. binding domain and cutting domain (SARS-CoV 2), or PPI (Protein-Protein Interactions) with different binding partners for the domains within the bound proteins
modular architecture allows evolutionary flexibility:
domains can evolve independently and be combined resulting in new functionality
enhanced stability of a protein through the domains
Experimental high-resolution structures are known for fewer than 10% of the ~20k human proteins. Give three reasons (3 bullets) 3pts.
Cost
-> Ressource limitations
Functional complexity & variability
-> leads to ‘Technical challenges’ depending on the chosen methods
further hindrance can be:
protein size
the presence of transmembrane regions,
presence of disorder or susceptibility to conformational change
time-consuming:
e.g. the protein must be produced in sufficient quantities and purified
Most methods predicting secondary structure in 3 states (helix, strand, other) predict strands much worse than helix.
Comment in fewer than 5 bullet points (telegram style; 5 pts).
much more data on helices than on the strands available
sometimes leading to a larger proportion of helix data in the training data
This imbalance in the data leads to helices being predicted better
Balancing the classes with over/undersampling leads to both classes being predicted similarly well, in this case, sheets would have to be oversampled
Helices are stabilized by hydrogen-bond formation, typically between residues i and i+4.
Does this mean secondary structure to be a 2D feature of structure (≤2 bullets/sentences; 2 pts).
No, it is a 1D feature as the states can be represented as a sequence string
2D is the distance map (e.g. ramachandran plot), which uses the information about hydrogen bonds and can be calculated knowing the sequence of the states (1D)
Q3 measures the percentage of residues correctly predicted in one of 3 states (H: helix, E: strand, O: other). Methods predicting H|E|O equally well may have a lower Q3 than those predicting O best.
Why? (1 bullet, 2 pts).
the Q3 score doesn’t take account of the false positive predictions
-> it only counts the true positive predicted residues
-> therefore, there is a higher number of corrected residues in total divided by the same amount of amino acids in the sequence
Predicting secondary structure through one-hot-encoding implies that each residue position is represented by 21 binary units. Why not 20? What else could you do?
Speculate why your alternative has not been used in existing methods (5 pts).
Why not 20?
because we need an extra slot for the n terminus
Alternative:
integer encoding: problem, the amino acids are not encoded independent of each other
Why doesn’t it exist:
One hot encoding preserves categorical independence: it ensures that the encoded features for different amino acids are orthogonal or independent of each other.
Each amino acid is represented by a separate binary unit, allowing machine learning algorithms to treat each amino acid equally and avoid any unintended relationships or biases that could be introduced by integer encoding.
one hot encoding is also more computational efficient compares to integer encoding
compatible with categorical data
Why do methods using evolutionary information perform better than those using one-hot encoding (≤2 bullets/sentences; 2 pts).
Evolution conserved the regions critical for the right structure: if important residues where changed, protein would not have survived
using these regions avoids noise and focused on the relevant regions for the structure (compared to just using the amino acids in a window nearby)
What do you need to consider in the comparison of methods predicting, e.g. secondary structure (you get a table and have to choose the number of digits, rank methods, note issues, asf).
accuracy metrics (and error rate) Q3 scores, Qi score i ={H,E,O}
Ensure that the methods are evaluated on the same dataset or comparable datasets to make fair comparisons.
Number of digits: Determine the level of precision required for reporting the results -> depending on the context and available data
THE breakthrough in protein prediction originated from using evolutionary information (originally in 1992 to predict secondary structure).
Where do you get evolutionary information from (≤2 bullets/sentences; 2 pts).
Multiple sequences alignment (from tools like BLAST or MSA databases like UniProt)
What should you do if your score on the test set is better than the score on the training set?
This is not possible, as the ‘top score possible’ on the training set is the holy grail of knowledge for the model.
There is a major problem. You have to start from scratch.
What are the major improvements/breakthroughs of AlphaFold?
higher accuracy due to the following improvements:
it also predicts the quality of its predictions
AF2 faster and more detailed
AlphaFold incorporates evolutionary information derived from multiple sequence alignments (MSAs) to enhance its predictions
Not one hot encoding, but weighed vector (how likely is each amino acid on each pos)
The more conserved an amino acid is over evolution, the higher the weight!
Couples of amino acids that change together? ‘Evolutionary couplings’ Close in space> predict contact map!
Much computational power and data needed!
Describe an AI (Artificial Intelligence)/ML (machine learning) method that predicts sub-cellular location (or cellular compartment) in three classes (c: cytoplasmic, e: extra-cellular, n: nuclear).
Make sure to explain how to cope with the fact that proteins have different lengths and AI/ML models need fixed input. 6 pts.
Input: Use the amino acid composition of the protein and its size
how to cope with different lengths as input: use one-hot encoding, where each amino acid is represented by a binary vector of fixed length that can be fed into the AI/ML model
neuronal network
SVM: support vector machine (works for less input data)
ONLY classifies 2 states
For 3 or more states: either 2 leveled SVM or confidence
Hierarchy of classification (from pathways) Problem: hierarchical ORDER (e.g. nucleus -> yes/no, … -> yes/no, …) has influence, follow up errors> first classification should be the most trivial
How can sequencing mistakes challenge per-protein predictions? 3 pts.
frameshifts (e.g. caused deletion) and change of amino acids (due to substitution or InDels) resulting in new sequences and wrong predictions
can impact the identification and characterization of functional domains, motifs, or important features within a protein sequence -> wrong identification can lead to Errors in these regions can mislead predictions related to protein function, subcellular localization, or interaction partners
You want to develop a method that predicts binding residues (e.g. enzymatic activity and DNA-binding). Your entire data set of proteins with experimentally known binding residues amounts to 500 sequence- unique proteins with a total of 5,000 binding and 45,000 non-binding residues. You can only use a simple artificial neural network (of the feed-forward style) with one hidden layer, but the complexity of your problem demands at least 100 hidden units. Thus, even a simple one- hot encoding (or evolutionary information) for a single residue with 20 units is not supported by the data.
Explain why. What could you do, instead?
Why: because of the limited amount of available data
Instead:
SVM (support vector machine) works better with less input data
approach to utilize sequence-based features that capture evolutionary information -> MSA or PSSMs (position-specific scoring matrices) can be used to represent the residues
these methods capture the conversation patterns across related protein sequences, providing more informative and robust features compared to one-hot encoding
What is redundancy reduction? How do you do it?
redundancy reduction should happen without loss of information
too similar data (nearly duplicates) is removed
hard to define what to remove in practice
How: Choose a threshold on data similarity depending on the problem and with that threshold remove the data which is already covered by the data set (e.g. define protein families as being similar)
Protein Language Models (pLMs) copy models from Natural Language Processing (NLP). Those learn grammar by putting words in their context in sentences.
Name the three analogies for grammar|word|sentence in pLMs? 3 pts
grammar: protein structure, protein class
word: amino acid
sentence: protein sequence
What problem do pLMs address? What is the meaning of embeddings from pLMs? 3 pts.
pLMs address the problems:
Protein Structure Prediction
Protein-Protein Interaction Prediction
Protein Sequence Analysis
Meaning of embeddings:
how to get the next amino acid that matches to the context
Input: amino acid sequence of length L, Output: Next amino acid
What is the difference between per-residue and per-protein embeddings?
per-residue embeddings: one vector for each residue in the protein -> leads to a matrix representation of sequence
goal: focus on capturing the local interactions and context of each residue in the protein sequence
useful for tasks that require residue-level predictions (e.g. secondary structure prediction, solvent accessibility prediction, or contact prediction)
per-protein embeddings: one vector for the entire protein -> leads to a vector representation of the sequence
goal: capture the global features and overall characteristics of the protein
useful for the protein classification, function prediction, or protein-protein interaction prediction (focus on the entire protein sequence rather than individual residues)
Bonus question: pLMs originate from CNNs that use predict sequences from sequences. Does it matter whether those are over-trained or not.
Explain in <5 bullet points/short sentences. 4 pts
avoiding over-training is crucial for pLMs to ensure:
generalizability
robustness
unbiased predictions in protein sequence analysis
to prevent over-training and achieve reliable and accurate predictions: -> use
regularization techniques
appropriate dataset selection
careful training procedures
Protein Language Models (pLMs) generate embeddings that are used as input to methods predicting protein secondary structure. Speculate why those methods reach the performance of MSA- based (multiple sequence alignment) methods (≤3 reasons; 3 pts).
embeddings contain a lot of information about the representation of amino acids, including patterns and relationships relevant for secondary structure prediction,
data gathered from different sources (e.g. sequence information, structural features, and evolutionary conservation)
just like MSA-based methods, it can take evolutionary information and sequence conservation patterns into consideration
usage of pre-trained pLMs allows improved prediction performance
Describe one way to test whether or not pLM-based capture evolutionary information (≤3 bullets; 3 pts).
Comparison with MSA-based methods since it includes evolutionary information:
Compare the performance of pLM-based methods with traditional MSA-based methods on similar datasets.
If the pLM-based methods demonstrate comparable or superior performance, it suggests that the pLM is incorporating evolutionary information effectively.
How can pLM-based protein prediction save energy/ resources (≤2 bullets/sentences; 2 pts)?
reduced computational power and runtime by applying parallel processing
PLM allows efficient work with the prediction results:
results of potential regions of interest or functional sites in proteins, allow researchers to focus only on that specific areas instead of having to look at the whole protein
Bonus question: are larger pLMs guaranteed to outperform smaller ones (Y/N and argue; ≤3 bullets; 3 pts).
NO! not in all scenarios
performance also depends on the quality and quantity of train and test data
might be overfitted
requires large computational power
How can we profit from pLMs for protein prediction?
able to use (leverage) large, unlabeled datasets
find new representations automatically (data-driven) even for domains which were hard to formalize (NLP = Natural Language Processing, CB = Computational Biology)
outperforms handcrafted features in many classes
CONs of pLMs
hard to interpret (less true for CV, more true for CB)
pre-training is computationally expensive
bias from databases (skin color, sex, religion or in CB: model organisms) is picked up by the model
Last changeda year ago