What is a domain in a protein?
A 3D unit that folds on its own
Define protein structure & protein function (each max 3 bullets) 3 pts
protein structure
3D arrangement
determined by the amino acid sequence, various forces and interactions between amino acids.
different categories: primary, secondary, tertiary, and quaternary structures.
protein function
specific role or activity carried out by a protein
determined by the protein's shape, active sites, binding capabilities, and interaction with other molecules.
drive answers
sequence defines structure, protein structure is the shape of a protein in which it folds in its natural ‘unbothered’ state
structure defines function
Protein structure in 1D, 2D, 3D: describe (each max 1 bullet) 3pts
1D: sequence
2D: local folding patterns, such as alpha helices and beta sheets, within the protein chain
3D: three-dimensional arrangement of the protein molecule
Speculate: why do most proteins have more than one structural domain? Experimental high-resolution structures are know for fewer than 10% of the ~20k human proteins. Give three reasons (3 bullets)
Functional versatility:
multiple domains => more (complex) functions
Regulation and flexibility:
multiple domains allow for conformal changes => responde to environmental chhanges / switch between different functional states
Evolutionary adaptation: Proteins with multiple domains may have evolved through gene duplication, fusion, or recombination events. This evolutionary process allows for the creation of new protein functions by combining existing domains in different arrangements. Having multiple domains provides the opportunity for proteins to adapt and evolve in response to selective pressures, leading to increased functional diversity in biological systems.
Note: The limited availability of experimental high-resolution structures for human proteins is primarily due to technical challenges, time, and resource limitations associated with protein structure determination methods such as X-ray crystallography and cryo-electron microscopy.
Most methods predicting secondary structure in 3 states (helix, strand, other) predict strands much worse than helix. Comment in fewer than 5 bullet points (telegram style; 5 pts).
strands have greater structural diversity
various conformations => prediciton more complex
Helices more defined and recurring patterns => better prediction based on empirical rules and statistical methods.
More data available for helixes => imbalanced dataset
balancing / redundancy reducing makes prediciton more similar
Helices are stabilized by hydrogen-bond formation, typically between residues i and i+4. Does this mean secondary structure to be a 2D feature of structure (≤2 bullets/sentences)
No.
While hydrogen bonds play a crucial role in stabilizing helices, secondary structure encompasses the local folding patterns throughout the protein chain.
These folding patterns extend in three dimensions, contributing to the overall three-dimensional structure of the protein.
Q3 measure the percentage of residues correctly predicted in one of 3 states (H: helix, E: strand, O: other). Methods predicting H|E|O equally well may have a lower Q3 than those predicting O best. Why? (1 bullet)
Imbalance in number of residues of classes => lower Q3 for methods predicting O best (even if performance is equal in all three states)
The imbalance in the number of residues belonging to each secondary structure class (helix, strand, and other) can lead to a lower Q3 for methods predicting O best, even if they perform equally well in predicting all three states.
Predicting secondary structure through one-hot-encoding implies that each residue position is represented by 21 binary units. Why not 20? What else could you do? Speculate why your alternative has not been used in existing methods (5 pts).
21nd unit is for n-terminus/other/unknown
Alternative approach: only use 20 binary units
methods, datasets, and evaluation metrics are built around 21 binary encoding scheme
performance loss <=> no handling of the unknown
21 binary units enables the model to capture and account for variations in secondary structure assignments, improving its ability to generalize to different protein sequences and environments.
Why do methods using evolutionary information perform better than those using one-hot encoding (≤2 bullets/sentences; 2 pts).
Capturing residue correlations:
Evolutionary information reflects the co-evolutionary patterns among residues => better accurarcy
Incorporating sequence conservation:
considering conserved residues => methods can better discriminate between different secondary structure elements => improve prediction accuracy.
What do you need to consider in the comparison of methods predicting, e.g. secondary structure (you get a table and have to chose the number of digits, rank methods, note issues, asf). THE breakthrough in protein prediction originated from using evolutionary information (originally in 1992 to predict secondary structure). Where do you get evolutionary information from (≤2 bullets/sentences).
error rate, confidence interval, how different scores differentiate
Gene Ontology data bases?
Multiple sequences alignment (form tools like BLAST)
Describe an AI (Artificial Intelligence)/ML (machine learning) method that predicts sub-cellular location (or cellular compartment) in three classes (c: cytoplasmic, e: extra-cellular, n: nuclear). Make sure to explain how to cope with the fact that proteins have different lengths and AI/ML models need fixed input.
train three binary classifier for each sub-cellular location -> connect into one prediction -> sum of predictions gives rise to sub-cellular location
Cope w/ different lengths: amino acid composition as input (use of intrinsic signals that govern the transport and localization in the cell)
How can sequencing mistakes challenge per-protein predictions? 3pts
False amino acid sequence => inaccuracies in subsequent analysis and prediction methods that rely on the sequence information.
Disrupted functional motifs =>affect predictions related to protein folding, interaction partners, or sub-cellular localization.
Misinterpretation of evolutionary information =>impact methods that rely on evolutionary information for predictions, such as homology-based approaches or sequence alignments.
You want to develop a method that predicts binding residues (e.g. enzymatic activity and DNA-binding). Your entire data set of proteins with experimentally known binding residues amounts to 500 sequence- unique proteins with a total of 5,000 binding and 45,000 non-binding residues. You can only use a simple artificial neural network (of the feed-forward style) with one hidden layer, but the complexity of your problem demands at least 100 hidden units. Thus, even a simple one- hot encoding (or evolutionary information) for a single residue with 20 units is not supported by the data. Explain why. What could you do, instead?
number of available positive examples (binding residues) is significantly smaller compared to the number of negative examples (non-binding residues).
use feature representation techniques that capture relevant information beyond the one-hot encoding of individual residues: Local sequence-based features, Structural features, Incorporate evolutionary information
Why: maybe those input nodes gave best performance?
Instead: encode group together amino acids with similar abilities (like acidic or positively charged...)?
Protein Language Models (pLMs) copy models from Natural Language Processing (NLP). Those learn grammar by putting words in their context in sentences. Name the three analogies for grammar|word|sentence in pLMs? 3 pts
What problem do pLMs address? What is the meaning of embeddings from pLMs? 3 pts.
Name the three analogies for grammar|word|sentence in pLMs
Amino Acid Ordering
Amino Acid Residues
Protein Sequence
Adressed problem
understanding and generation of meaningful representations of protein sequences
Meaning of embeddings
embeddings encode the contextual and semantic information of proteins, allowing for similarity comparisons, clustering, and downstream analysis tasks.
low-dimensional representation of protein sequences
How can we profit from pLMs for protein prediction?
Improved Accuracy
Broad Applicability
Accelerated Research
What is the difference between per-residue and per- protein embeddings?
one vector each individual residue
per-protein
one vector for the whole protein
pLMs originate from CNNs that use predict sequences from sequences. Does it matter whether those are over-trained or not. Explain in <5 bullet points/short sentences.
lack of Generalization:
better Performance on Similar Sequences to seqs in training
loss in Robustness to Noise
Bias towards specific sequence motifs or patterns present in the training data
Protein Language Models (pLMs) generate embeddings that are used as input to methods predicting protein secondary structure. Speculate why those methods reach the performance of MSA-based (multiple sequence alignment) methods (≤3 reasons; 3 pts).
Capturing Long-Range Dependencies: pLMs capture long-range dependencies in protein sequences => identify global structural patterns
Learning from Large-Scale Data: pLMs are trained on large-scale protein sequence data => learn from a diverse range of sequences => capture common patterns and generalize well to unseen protein sequences, leading to competitive performance in secondary structure prediction.
Implicit Integration of Evolutionary Information: pLMs implicitly learn from evolutionary information encoded in the protein sequences during training => capture evolutionary constraints and sequence conservation, which are crucial for accurate secondary structure prediction. This enables pLMs to achieve comparable performance to MSA-based methods that explicitly utilize evolutionary information.
Describe one way to test whether or not pLM-based capture evolutionary information (≤3 bullets)
Conduct Transfer Learning
Compare with Evolutionary Methods
Analyze Embeddings
How can pLM-based protein prediction save energy/ resources (≤2 bullets)
Higher Computational Efficiency
After Training, computational power needed is low
Reduce Experimental Work needed
FAIR?
Findable
Accessible
Interoperable
Reusable
Are larger pLMs guaranteed to outperform smaller ones (Y/N and argue; ≤3 bullets; 3 pts)
model size not determining factor
Training time is more relevant
Zuletzt geändertvor einem Jahr