undefined

Buffl

PP2

by zeynep K.

There are some proteins that have very similar sequence but different functions. Why can that be?

multi-domain problem: a protein can consist of many domains that all have different functions. Annotation might be based on only one of those domains.
Change of a single AA can change function —> lead to a disease
Moonlighting proteins: Different function in different tissues, or different ph levels…)
Paralogy: Similar sequence but different protein function
Binding of cofactor

What are the problems with using function annotations?

Hard to define function (even for the same protein)
Ligands used for PDB are not identical to cognate ligands
wrong annotation due to multi-domain problem
Gene Ontology: Not enough annotations —> small classes —> possible overfitting

Transient protein-protein interactions of different proteins:

Why “physical”?

Why “transient”?

What is meant by “different” proteins?

The different between PP interactions and associations?

physical: contact between two proteins (<6.5 A)

transient: Only temporary interactions count as PPI

different: No homodimers

Difference between PPI and association:

PPI physical interaction
Association is non-physical —> Coexpression of different proteins

What are 3 classes of Gene Oncology?

Bological process (Signal transduction)
Molecular function (enzyme)
Cellular localization (nuclear matrix)

Why is the data in molecular biology biased?

You can reduce redundancy using some threshold T. How to define T?

Biased:

sequence redundancy
Some things are easier to test experimentally (some model organisms or prokaryotes easier than eukaryotes)
Some things are more interesting to experiment
Already annotates proteins easier to annotate again in different organisms

Threshold:

Test different thresholds: middle point between performance and generalization

List limitations of homology-based inferences.

need for an annotated homologous protein sequence ( <25% have annotations)
due to bias or multidomain problem, annotations might be wrong
threshold for sequence similarity is open for discussion (only half of proteins with 60% sequence similarity have similar functions)
paralogs

Give reasons why most database annotations are homology-based inferences rather than experimental annotations

less time
less money
less requirements (lab…)

Why is Gene Ontology better than Enzyme Commision numbers? And why worse?

Better:

More interpretable by humans since it has biological context
GO for every protein EC only for enzymes

Worse:

GO change very often
Too many very specific terms —> less proteins for every class

What are the meaning of date and party hubs in context of PPIs?

If you had a method predicting which residue is involved in PPIs, which one of these two would you hope to capture?

Date Hubs: One interaction at a time of a hotspot (binding site)

Party hubs: Multiple interactions at the same time at different binding sites.

We capture only party hubs—> protein has a lot of binding sites

Name two advantages and two disadvantages of embedding-based over expert-crafted features input to ML

examples:

embeddings: created only from sequence by an NLP model
expert-crafted: nr of binding sites, pH, hydrophobicity

Pros:

Finds new representations
Non obvious correlations
Can learn large unlabeled datasets

Cons:

Hard to interpret
Pre training is computationally expensive
Bias

Definition of paralogs and ortologs

Both are homologs, which mean thet have similar sequence.

Paralog: common ancestry but diverged in function

Ortholog: Same protein with similar function but in different species

Differences between homology-based inference and machine learning?

HBI requres homologous annotated proteins —> ML not

HBI needs evolutionary relationship between proteins —> ML not

HBI based on homologs, ML representative set of annotated sequences

different biases

HBI usually outperforms

How can you solve a multi-class problem with a binary classification model?

Why would you do that?

Hierarchical SVM: in form of tree

SVMs form the nodes and predictions the leaves

SVM with lowest misclassification on the top level

Advantages:

Predicts intermediate classes and not just final class
tree can be extended
Biological processes can be modeled

Disadvantages:

Mistakes on top node cannot be corrected

You want to develop a new methods that predicts GO terms.

Name 3 problems you might encounter and propose generic solutions

GO terms are too specific, classes too small, prediction methods biased to less specific terms

solution:
- Weight according to level of GO term

Many terms with only a few annotated proteins

Solution:
- Prune very specific terms

GO changes, if classes deleted, you need to retrain:

Solution: Hierarchical SVM (can add/change single components)

You want to subcellular localization in eukaryotes in ten states.

Name two methods

LocTree2 or LocTree3:

SVM based
hierarchical system of SVM
imitates cascading mechanism of cellular sorting

DeepLoc:

CNN for motif discovery
RNN to create embeddings
LSTM focuses on what in the protein is the most important for
the localization decision (attention mechanism)

Own:

Neural Network with 10 nodes in last layer

Signal peptides and nuclear localization signals (NLS) are two examples for relatively short (<10% of entire protein) sequence motifs of subcellular location.

Name two reasons why some motifs are more easily machine-learnable (e.g. signal peptides) than others (e.g. NLS)?

Signal peptides in all kingdoms of life —> more data available

Signal peptides have clear architecture

Signal peptides are always contiguous sequences at the N-terminus of the protein, NLSs are position-insensitive and possibly segmented

Which are more relevant to predict subcellular localization: surface- or core residues?

Surface residues, because:

they interact with the environment
signal peptides on the surface —> contain more information about protein localization

You want to predict binding of small molecules and have a fairly small dataset.

How could you solve this using a simple feed-forward ANN?

What expert-selected features can you add?

ANN:

use sliding window with a fixed size, run it over protein, for each center residue predict if it’s binding

expert-selected features:

evolutionary couplings
surface and core residues

You developed a method that predicts physical, transient protein-protein interactions (PPIs). Sketch three ways of reducing the false positives. Why would reducing FPs not necessarily improve performance?

predicting surface residues

predicting subcellular localization

You want to develop a method that predicts the effect of SAVs (single amino acid variants) upon molecular function. Name three challenges that you are likely to encounter in the experimental data.

hard and expensice to experimentally determine
bias in database
dataset imbalance —> more effective variances than neutral

Why would the approach “ a specific drug for each genotype” wouldn’t work?

It takes 14 years and 4 billion dollars to develop a new drug

currently only 40 drugs per year

How many sequences are on UniProt

250 million

How many sequences are on SwissProt

550 000

You have developed a method based on machine learning generating an ML model that predicts location in 10 states. On your test set performance was Q10=65%; you used that set to decide which model was best. Comment. What else do you need to do in order to ascertain that performance reaches a certain value.

Don’t choose the test set for hyper parameter tuning and model selection, use validation set for that
Seperate test and validation tests
Calculate more scores

You developed a method to predict location in 10 classes. You want to know what is the best possible method: how could you establish a “ceiling” for your prediction method? How could you get the “bottom”

Ceiling: Calculate hypothetical best result based on experimental errors, background noise, label confidence
Bottom: Random model

Four problems for function annotation

paralogy problem
Moonlighting problem
multidomain problem
Database annotation problem

NLS

Nuclear Localization Signal

What is the genetic difference between all humans?

%0.1

20 000 amino acids

Fraction of known PPIs

13 500

If you do not have enough experimental data, how can you still build a machine learning method that works

Get more data/Do more experiments
Oversampling
SMOTE (for oversampling)

Join Course

Preview

Author

zeynep K.

Information

Last changed
3 years ago

Report course

Cardset

Author

zeynep K.

Information