undefined

Buffl

AdvBinfo

by Korbinian P.

What is a domain in a protein?

= 3D unit that folds on its own, without other parts of the protein

Define protein structure & protein function (each max 3 bullets) 3 pts

Protein Structure:

amino acid sequence defines structure
- = shape of a protein, in which it folds in its natural ‘unbothered’ state
  -> structure defines function
- there are 4 levels of protein structure with different levels of complexity
  - primary (AS),
  - secondary (alpha, beta etc),
  - tertiary (with bonds),
  - quaternary (domains)

Protein Function:

diverse: chemical, biochemical, cellular, developmental, physiological, genetic
depending on the perspective: the function of a protein is anything that happens to or through a protein

Protein structure in 1D, 2D, 3D: describe (each max 1 bullet) 3 pts

1D: amino acid sequence (e.g. MAYT…) or string of secondary structure states
-> (explanation for us: because it is 1D representable -> e.g. HHCCEEH…) (alpha helix, beta sheet, coil, 2D != secondary structure)
2D: inter-residue distances and visualization in a graph (ramachandran plot/distance map)
3D: coordinates of protein structure and visualization

Speculate: Why do most proteins have more than one structural domain?

enhanced functionality:
- different domains within that/those protein/s can perform multiple task at the same time or sequentially
  - e.g. binding domain and cutting domain (SARS-CoV 2), or PPI (Protein-Protein Interactions) with different binding partners for the domains within the bound proteins
modular architecture allows evolutionary flexibility:
- domains can evolve independently and be combined resulting in new functionality
enhanced stability of a protein through the domains

Experimental high-resolution structures are known for fewer than 10% of the ~20k human proteins. Give three reasons (3 bullets) 3pts.

Cost
-> Ressource limitations
Functional complexity & variability
-> leads to ‘Technical challenges’ depending on the chosen methods
- further hindrance can be:
  - protein size
  - the presence of transmembrane regions,
  - presence of disorder or susceptibility to conformational change
time-consuming:
- e.g. the protein must be produced in sufficient quantities and purified

Most methods predicting secondary structure in 3 states (helix, strand, other) predict strands much worse than helix.

Comment in fewer than 5 bullet points (telegram style; 5 pts).

much more data on helices than on the strands available
- sometimes leading to a larger proportion of helix data in the training data
- This imbalance in the data leads to helices being predicted better
Balancing the classes with over/undersampling leads to both classes being predicted similarly well, in this case, sheets would have to be oversampled

Helices are stabilized by hydrogen-bond formation, typically between residues i and i+4.

Does this mean secondary structure to be a 2D feature of structure (≤2 bullets/sentences; 2 pts).

No, it is a 1D feature as the states can be represented as a sequence string
2D is the distance map (e.g. ramachandran plot), which uses the information about hydrogen bonds and can be calculated knowing the sequence of the states (1D)

Q3 measures the percentage of residues correctly predicted in one of 3 states (H: helix, E: strand, O: other). Methods predicting H|E|O equally well may have a lower Q3 than those predicting O best.

Why? (1 bullet, 2 pts).

the Q3 score doesn’t take account of the false positive predictions
-> it only counts the true positive predicted residues
-> therefore, there is a higher number of corrected residues in total divided by the same amount of amino acids in the sequence

Predicting secondary structure through one-hot-encoding implies that each residue position is represented by 21 binary units. Why not 20? What else could you do?

Speculate why your alternative has not been used in existing methods (5 pts).

Why not 20?

because we need an extra slot for the n terminus

Alternative:

integer encoding: problem, the amino acids are not encoded independent of each other

Why doesn’t it exist:

One hot encoding preserves categorical independence: it ensures that the encoded features for different amino acids are orthogonal or independent of each other.
Each amino acid is represented by a separate binary unit, allowing machine learning algorithms to treat each amino acid equally and avoid any unintended relationships or biases that could be introduced by integer encoding.
one hot encoding is also more computational efficient compares to integer encoding
compatible with categorical data

Why do methods using evolutionary information perform better than those using one-hot encoding (≤2 bullets/sentences; 2 pts).

Evolution conserved the regions critical for the right structure: if important residues where changed, protein would not have survived
using these regions avoids noise and focused on the relevant regions for the structure (compared to just using the amino acids in a window nearby)

What do you need to consider in the comparison of methods predicting, e.g. secondary structure (you get a table and have to choose the number of digits, rank methods, note issues, asf).

accuracy metrics (and error rate) Q3 scores, Qi score i ={H,E,O}
Ensure that the methods are evaluated on the same dataset or comparable datasets to make fair comparisons.
Number of digits: Determine the level of precision required for reporting the results -> depending on the context and available data

THE breakthrough in protein prediction originated from using evolutionary information (originally in 1992 to predict secondary structure).

Where do you get evolutionary information from (≤2 bullets/sentences; 2 pts).

Multiple sequences alignment (from tools like BLAST or MSA databases like UniProt)

What should you do if your score on the test set is better than the score on the training set?

This is not possible, as the ‘top score possible’ on the training set is the holy grail of knowledge for the model.
There is a major problem. You have to start from scratch.

What are the major improvements/breakthroughs of AlphaFold?

higher accuracy due to the following improvements:
it also predicts the quality of its predictions
AF2 faster and more detailed
AlphaFold incorporates evolutionary information derived from multiple sequence alignments (MSAs) to enhance its predictions
Not one hot encoding, but weighed vector (how likely is each amino acid on each pos)
The more conserved an amino acid is over evolution, the higher the weight!
Couples of amino acids that change together? ‘Evolutionary couplings’ Close in space> predict contact map!
Much computational power and data needed!

Describe an AI (Artificial Intelligence)/ML (machine learning) method that predicts sub-cellular location (or cellular compartment) in three classes (c: cytoplasmic, e: extra-cellular, n: nuclear).

Make sure to explain how to cope with the fact that proteins have different lengths and AI/ML models need fixed input. 6 pts.

Input: Use the amino acid composition of the protein and its size
- how to cope with different lengths as input: use one-hot encoding, where each amino acid is represented by a binary vector of fixed length that can be fed into the AI/ML model
neuronal network
SVM: support vector machine (works for less input data)
- ONLY classifies 2 states
- For 3 or more states: either 2 leveled SVM or confidence
- Hierarchy of classification (from pathways) Problem: hierarchical ORDER (e.g. nucleus -> yes/no, … -> yes/no, …) has influence, follow up errors> first classification should be the most trivial

How can sequencing mistakes challenge per-protein predictions? 3 pts.

frameshifts (e.g. caused deletion) and change of amino acids (due to substitution or InDels) resulting in new sequences and wrong predictions

can impact the identification and characterization of functional domains, motifs, or important features within a protein sequence -> wrong identification can lead to Errors in these regions can mislead predictions related to protein function, subcellular localization, or interaction partners

You want to develop a method that predicts binding residues (e.g. enzymatic activity and DNA-binding). Your entire data set of proteins with experimentally known binding residues amounts to 500 sequence- unique proteins with a total of 5,000 binding and 45,000 non-binding residues. You can only use a simple artificial neural network (of the feed-forward style) with one hidden layer, but the complexity of your problem demands at least 100 hidden units. Thus, even a simple one- hot encoding (or evolutionary information) for a single residue with 20 units is not supported by the data.

Explain why. What could you do, instead?

Why: because of the limited amount of available data
Instead:
- SVM (support vector machine) works better with less input data
- approach to utilize sequence-based features that capture evolutionary information -> MSA or PSSMs (position-specific scoring matrices) can be used to represent the residues
  - these methods capture the conversation patterns across related protein sequences, providing more informative and robust features compared to one-hot encoding

What is redundancy reduction? How do you do it?

redundancy reduction should happen without loss of information
- too similar data (nearly duplicates) is removed
- hard to define what to remove in practice
How: Choose a threshold on data similarity depending on the problem and with that threshold remove the data which is already covered by the data set (e.g. define protein families as being similar)

Protein Language Models (pLMs) copy models from Natural Language Processing (NLP). Those learn grammar by putting words in their context in sentences.

Name the three analogies for grammar|word|sentence in pLMs? 3 pts

grammar: protein structure, protein class
word: amino acid
sentence: protein sequence

What problem do pLMs address? What is the meaning of embeddings from pLMs? 3 pts.

pLMs address the problems:
- Protein Structure Prediction
- Protein-Protein Interaction Prediction
- Protein Sequence Analysis
Meaning of embeddings:
- how to get the next amino acid that matches to the context
- Input: amino acid sequence of length L, Output: Next amino acid

What is the difference between per-residue and per-protein embeddings?

per-residue embeddings: one vector for each residue in the protein -> leads to a matrix representation of sequence
- goal: focus on capturing the local interactions and context of each residue in the protein sequence
- useful for tasks that require residue-level predictions (e.g. secondary structure prediction, solvent accessibility prediction, or contact prediction)
per-protein embeddings: one vector for the entire protein -> leads to a vector representation of the sequence
- goal: capture the global features and overall characteristics of the protein
- useful for the protein classification, function prediction, or protein-protein interaction prediction (focus on the entire protein sequence rather than individual residues)

Bonus question: pLMs originate from CNNs that use predict sequences from sequences. Does it matter whether those are over-trained or not.

Explain in <5 bullet points/short sentences. 4 pts

avoiding over-training is crucial for pLMs to ensure:
- generalizability
- robustness
- unbiased predictions in protein sequence analysis
to prevent over-training and achieve reliable and accurate predictions: -> use
- regularization techniques
- appropriate dataset selection
- careful training procedures

Protein Language Models (pLMs) generate embeddings that are used as input to methods predicting protein secondary structure. Speculate why those methods reach the performance of MSA- based (multiple sequence alignment) methods (≤3 reasons; 3 pts).

embeddings contain a lot of information about the representation of amino acids, including patterns and relationships relevant for secondary structure prediction,
- data gathered from different sources (e.g. sequence information, structural features, and evolutionary conservation)
- just like MSA-based methods, it can take evolutionary information and sequence conservation patterns into consideration
usage of pre-trained pLMs allows improved prediction performance

Describe one way to test whether or not pLM-based capture evolutionary information (≤3 bullets; 3 pts).

Comparison with MSA-based methods since it includes evolutionary information:
- Compare the performance of pLM-based methods with traditional MSA-based methods on similar datasets.
- If the pLM-based methods demonstrate comparable or superior performance, it suggests that the pLM is incorporating evolutionary information effectively.

How can pLM-based protein prediction save energy/ resources (≤2 bullets/sentences; 2 pts)?

reduced computational power and runtime by applying parallel processing
PLM allows efficient work with the prediction results:
- results of potential regions of interest or functional sites in proteins, allow researchers to focus only on that specific areas instead of having to look at the whole protein

Bonus question: are larger pLMs guaranteed to outperform smaller ones (Y/N and argue; ≤3 bullets; 3 pts).

NO! not in all scenarios
- performance also depends on the quality and quantity of train and test data
- might be overfitted
- requires large computational power

How can we profit from pLMs for protein prediction?

able to use (leverage) large, unlabeled datasets
find new representations automatically (data-driven) even for domains which were hard to formalize (NLP = Natural Language Processing, CB = Computational Biology)
outperforms handcrafted features in many classes

CONs of pLMs

hard to interpret (less true for CV, more true for CB)
pre-training is computationally expensive
bias from databases (skin color, sex, religion or in CB: model organisms) is picked up by the model

Join Course

Preview

Author

Korbinian P.

Information

Last changed
2 years ago

Report course

VL1-3-Rost

Author

Korbinian P.

Information