There are some proteins that have very similar sequence but different functions. Why can that be?
multi-domain problem: a protein can consist of many domains that all have different functions. Annotation might be based on only one of those domains.
Change of a single AA can change function —> lead to a disease
Moonlighting proteins: Different function in different tissues, or different ph levels…)
Paralogy: Similar sequence but different protein function
Binding of cofactor
What are the problems with using function annotations?
Hard to define function (even for the same protein)
Ligands used for PDB are not identical to cognate ligands
wrong annotation due to multi-domain problem
Gene Ontology: Not enough annotations —> small classes —> possible overfitting
Transient protein-protein interactions of different proteins:
Why “physical”?
Why “transient”?
What is meant by “different” proteins?
The different between PP interactions and associations?
physical: contact between two proteins (<6.5 A)
transient: Only temporary interactions count as PPI
different: No homodimers
Difference between PPI and association:
PPI physical interaction
Association is non-physical —> Coexpression of different proteins
What are 3 classes of Gene Oncology?
Bological process (Signal transduction)
Molecular function (enzyme)
Cellular localization (nuclear matrix)
Why is the data in molecular biology biased?
You can reduce redundancy using some threshold T. How to define T?
Biased:
sequence redundancy
Some things are easier to test experimentally (some model organisms or prokaryotes easier than eukaryotes)
Some things are more interesting to experiment
Already annotates proteins easier to annotate again in different organisms
Threshold:
Test different thresholds: middle point between performance and generalization
List limitations of homology-based inferences.
need for an annotated homologous protein sequence ( <25% have annotations)
due to bias or multidomain problem, annotations might be wrong
threshold for sequence similarity is open for discussion (only half of proteins with 60% sequence similarity have similar functions)
paralogs
Give reasons why most database annotations are homology-based inferences rather than experimental annotations
less time
less money
less requirements (lab…)
Why is Gene Ontology better than Enzyme Commision numbers? And why worse?
Better:
More interpretable by humans since it has biological context
GO for every protein EC only for enzymes
Worse:
GO change very often
Too many very specific terms —> less proteins for every class
What are the meaning of date and party hubs in context of PPIs?
If you had a method predicting which residue is involved in PPIs, which one of these two would you hope to capture?
Date Hubs: One interaction at a time of a hotspot (binding site)
Party hubs: Multiple interactions at the same time at different binding sites.
We capture only party hubs—> protein has a lot of binding sites
Name two advantages and two disadvantages of embedding-based over expert-crafted features input to ML
examples:
embeddings: created only from sequence by an NLP model
expert-crafted: nr of binding sites, pH, hydrophobicity
Pros:
Finds new representations
Non obvious correlations
Can learn large unlabeled datasets
Cons:
Hard to interpret
Pre training is computationally expensive
Bias
Definition of paralogs and ortologs
Both are homologs, which mean thet have similar sequence.
Paralog: common ancestry but diverged in function
Ortholog: Same protein with similar function but in different species
Differences between homology-based inference and machine learning?
HBI requres homologous annotated proteins —> ML not
HBI needs evolutionary relationship between proteins —> ML not
HBI based on homologs, ML representative set of annotated sequences
different biases
HBI usually outperforms
How can you solve a multi-class problem with a binary classification model?
Why would you do that?
Hierarchical SVM: in form of tree
SVMs form the nodes and predictions the leaves
SVM with lowest misclassification on the top level
Advantages:
Predicts intermediate classes and not just final class
tree can be extended
Biological processes can be modeled
Disadvantages:
Mistakes on top node cannot be corrected
You want to develop a new methods that predicts GO terms.
Name 3 problems you might encounter and propose generic solutions
GO terms are too specific, classes too small, prediction methods biased to less specific terms
solution:
Weight according to level of GO term
Many terms with only a few annotated proteins
Solution:
Prune very specific terms
GO changes, if classes deleted, you need to retrain:
Solution: Hierarchical SVM (can add/change single components)
You want to subcellular localization in eukaryotes in ten states.
Name two methods
LocTree2 or LocTree3:
SVM based
hierarchical system of SVM
imitates cascading mechanism of cellular sorting
DeepLoc:
CNN for motif discovery
RNN to create embeddings
LSTM focuses on what in the protein is the most important for
the localization decision (attention mechanism)
Own:
Neural Network with 10 nodes in last layer
Signal peptides and nuclear localization signals (NLS) are two examples for relatively short (<10% of entire protein) sequence motifs of subcellular location.
Name two reasons why some motifs are more easily machine-learnable (e.g. signal peptides) than others (e.g. NLS)?
Signal peptides in all kingdoms of life —> more data available
Signal peptides have clear architecture
Signal peptides are always contiguous sequences at the N-terminus of the protein, NLSs are position-insensitive and possibly segmented
Which are more relevant to predict subcellular localization: surface- or core residues?
Surface residues, because:
they interact with the environment
signal peptides on the surface —> contain more information about protein localization
You want to predict binding of small molecules and have a fairly small dataset.
How could you solve this using a simple feed-forward ANN?
What expert-selected features can you add?
ANN:
use sliding window with a fixed size, run it over protein, for each center residue predict if it’s binding
expert-selected features:
evolutionary couplings
surface and core residues
You developed a method that predicts physical, transient protein-protein interactions (PPIs). Sketch three ways of reducing the false positives. Why would reducing FPs not necessarily improve performance?
predicting surface residues
predicting subcellular localization
You want to develop a method that predicts the effect of SAVs (single amino acid variants) upon molecular function. Name three challenges that you are likely to encounter in the experimental data.
hard and expensice to experimentally determine
bias in database
dataset imbalance —> more effective variances than neutral
Why would the approach “ a specific drug for each genotype” wouldn’t work?
It takes 14 years and 4 billion dollars to develop a new drug
currently only 40 drugs per year
How many sequences are on UniProt
250 million
How many sequences are on SwissProt
550 000
You have developed a method based on machine learning generating an ML model that predicts location in 10 states. On your test set performance was Q10=65%; you used that set to decide which model was best. Comment. What else do you need to do in order to ascertain that performance reaches a certain value.
Don’t choose the test set for hyper parameter tuning and model selection, use validation set for that
Seperate test and validation tests
Calculate more scores
You developed a method to predict location in 10 classes. You want to know what is the best possible method: how could you establish a “ceiling” for your prediction method? How could you get the “bottom”
Ceiling: Calculate hypothetical best result based on experimental errors, background noise, label confidence
Bottom: Random model
Four problems for function annotation
paralogy problem
Moonlighting problem
multidomain problem
Database annotation problem
NLS
Nuclear Localization Signal
What is the genetic difference between all humans?
%0.1
20 000 amino acids
Fraction of known PPIs
13 500
If you do not have enough experimental data, how can you still build a machine learning method that works
Get more data/Do more experiments
Oversampling
SMOTE (for oversampling)
Last changeda year ago