what is a domain in a protein?
a 3D unit that folds on its own
Fair
○ Findable
○ Accessible
○ Interoperable
○ Reusable
Define protein structure & protein function (each max 3 bullets) 3 pts
○ Function = anything that happens to or through a protein
○ sequence defines structure, protein structure is the shape of a protein in
which it folds in its natural ‘unbothered’ state
○ structure defines function
○ protein function is diverse, can not be easily defined or classified
○ Functions: Defense (e.g. antibodies), Structure (e.g. collagen), Enzymes –
metabolism, catabolism, Communication / Signaling (e.g. insulin), Ligand
binding / Transport (e.g. hemoglobin), Storage (e.g. ferritin)
Protein structure in 1D, 2D, 3D: describe (each max 1 bullet) 3 pts
○ 1D: sequence or string of secondary structure states
○ 2D: inter-residue distances and visualization in a graph(distance map)
○ 3D: coordinates of protein structure and visualization as a picture
Speculate: why do most proteins have more than one structural domain?
Domains are parts of proteins that have specific functions or structures.
Proteins can have multiple domains to perform complex functions or create
new functions by recombining domains.
Experimental high-resolution structures are known for fewer than 10% of the ~20k
human proteins. Give three reasons (3 bullets) 3pts.
○ expensive
○ time intensive
○ hard to extract and or analyse different proteins
Most methods predicting secondary structure in 3 states (helix, strand, other) predict
strands much worse than helix. Comment in fewer than 5 bullet points (telegram
style; 5 pts).
○ There is much more data on helixes than on strands
○ This imbalance in the data leads to helices being predicted better
○ Balancing the classes with over/undersampling leads to both classes being
predicted similarly well
○ helices are local, sheets do not have to be
Helices are stabilized by hydrogen-bond formation, typically between residues i and i+4. Does this mean secondary structure to be a 2D feature of structure (≤2 bullets/sentences; 2 pts).
Secondary structure is not a 2D feature of structure but a 1D feature because a position can be categorized into three fold states (H: helix, E: strand, O: other) and therefore the possible predictions are only in one dimension as it is a string.
Q3 measures the percentage of residues correctly predicted in one of 3 states (H: helix, E: strand, O: other). Methods predicting H|E|O equally well may have a lower Q3 than those predicting O best. Why? (1 bullet, 2 pts)
majority of datapoint is o -> better at o means more correct predictions overall
Predicting secondary structure through one-hot-encoding implies that each residue
position is represented by 21 binary units. Why not 20? What else could you do?
Speculate why your alternative has not been used in existing methods (5 pts).
In one-hot encoding each amino acid is represented by a unit binary vector of
length n, containing a single one and n-1 zeros ([1,0,0, ... ., O] for one amino
acid). When predicting secondary structure through one-hot-encoding one
uses 21 binary units because the "." encodes for "non protein unit" which can
be used for padding
Why do methods using evolutionary information perform better than those using one-
hot encoding (≤2 bullets/sentences; 2 pts).
Evolution conserved the regions critical for the right structure, so using these
regions avoids noise and focused on the relevant regions for the structure
(compared to just using the amino acids in a window nearby)
○ MSAs can be used to generate Positional Specific Scoring Matrices (PSSMs)
which can be taken as input for training and has demonstrated great potential
in predicting protein structure and function.
What do you need to consider in the comparison of methods predicting, e.g.
secondary structure (you get a table and have to choose the number of digits, rank
methods, note issues, asf).
you need to consider the error rate, how the different scores differentiate and
the confidence interval
○ dataset & testing
○ how good is a random prediction
○ runtime
THE breakthrough in protein prediction originated from using evolutionary information
(originally in 1992 to predict secondary structure). Where do you get evolutionary
information from (≤2 bullets/sentences; 2 pts).
Multiple sequences alignment (from tools like BLAST)
○ MSAs are a way of encoding evolutionary information by aligning several
similar proteins
○ . MSAs can be used to generate positional Specific Scoring Matrices PSSMs
○ PSSMs can be taken as input for training and has demonstrated great
potential in predicting protein structure and function
What should you do if your score on the test set is better than the score on the
training set?
This is not possible, as the ‘top score possible’ on the training set is the holy
grail of knowledge for the model. There must be a mistake so start from
scratch.
What are the major improvements/breakthroughs of AlphaFold?
○ it also predicts the quality of its predictions
○ Feature preprocessing: The input protein sequence is used to search for
similar sequences and known structures in large databases, using tools like
MMseqs2 and HHsearch. The results are used to generate multiple sequence
alignments and templates that capture the evolutionary and structural
information of the protein.
○ The multiple sequence alignments and templates are fed into a deep neural
network that predicts the distances and angles between pairs of amino acids,
as well as the confidence of these predictions. These predictions are then
converted into 3D coordinates using a gradient descent algorithm.
○ The predicted 3D coordinates are refined by a molecular dynamics simulation
that minimizes the potential energy of the protein structure and makes it more
physically realistic.
Describe an AI (Artificial Intelligence)/ML (machine learning) method that predicts
sub-cellular location (or cellular compartment) in three classes (c: cytoplasmic, e:
extra-cellular, n: nuclear). Make sure to explain how to cope with the fact that
proteins have different lengths and AI/ML models need fixed input. 6 pts.
Train three binary classifiers (neural network) one for each sub-cellular
location (cytoplasmic c, extra-cellular e, nuclear n) and connect them into one
prediction scheme.
○ To cope with the fact that proteins have different lengths and Al/ML models
need fixed input we use the amino acid composition as input because
proteins have intrinsic signals that govern the transport and localization in the
cell.
How can sequencing mistakes challenge per-protein predictions? 3 pts.
because a prediction tool can only be as good as the data on which it was
trained on. If those data contain sequencing mistakes, then more prediction
mistakes will be done by the tool due to misleading training data
You want to develop a method that predicts binding residues (e.g. enzymatic activity
and DNA-binding). Your entire data set of proteins with experimentally known binding
residues amounts to 500 sequence- unique proteins with a total of 5,000 binding and
45,000 non-binding residues. You can only use a simple artificial neural network (of
the feed-forward style) with one hidden layer, but the complexity of your problem
demands at least 100 hidden units. Thus, even a simple one- hot encoding (or
evolutionary information) for a single residue with 20 units is not supported by the
data. Explain why. What could you do, instead?
Rule of thumb: need 10 times more data points than free parameters
○ number of free parameters in this model: 20*100+100*2=2200○ Why: because these input nodes gave the best performance?
○ Instead: encode group together amino acids with similar abilities (like acidic or
positively charged…)
What is redundancy reduction? How do you do it?
○ Generally removing too similar data
○ Hard to define what to remove in practice
○ Choose a threshold depending on the problem e.g. define protein families as
being similar
Protein Language Models (pLMs) copy models from Natural Language Processing
(NLP). Those learn grammar by putting words in their context in sentences. Name
the three analogies for grammar|word|sentence in pLMs? 3 pts
Protein Language Models (pLMs) apply methods from the field of natural
language of life to capture the intrinsic language of proteins. While a single
word can have a different meaning depending on its context and sentence, an
amino acid can have a different effect depending on its surrounding residues.
word: amino acid sentence: protein sequence grammar: combination of words
which result in meaning
What problem do pLMs address?
Can leverage large, unlabelled datasets
○ Can find new representations automatically (data-driven) even for domains
which were hard to formalize (NLP, CB)
○ Outperforms handcrafted features in many cases
○ pLMs use end-to-end transformers to go from sequence-to-sequence. The
second to last layer of the neural network can be read to obtain a so-called
embedding which is a great numerical representation of the protein. These
embeddings can be used in machine learning tasks to predict features of a
protein.
What is the meaning of embeddings from pLMs? 3 pts.
Embeddings are a machine-readable representation of protein sequences by
converting text into vectors of numbers representing relevant features or
descriptors of proteins is an important first step to find out properties of the
protein with that sequence, e.g., what other proteins it resembles (sequence
comparisons through alignments), what it looks like (membrane or water-
soluble, regular globular or disordered), or what it does (enzyme or not,
process involved in, molecular function, interaction partners).
How can we profit from pLMs for protein prediction?
We can use them as input
○ It captures more features than expert crafted Models
What is the difference between per-residue and per- protein embeddings?
Embeddings are a machine-readable representation from protein sequences
by converting text into vectors of numbers representing relevant features.
These come in two flavors: per-residue and per-protein. While the per-residue
embeddings are taken directly out of the LMs, per-protein embeddings are
generated post-processing the information extracted by the LM through globalaverage pooling on all combined per-residue embeddings of a sequence. Per-
residue embeddings are useful to analyze properties of residues in a protein
e.g., which residues bind ligands, while per-protein representations capture
annotations describing entire proteins e.g., native localization.
Bonus question: pLMs originate from CNNs that use predict sequences from
sequences. Does it matter whether those are over-trained or not. Explain in <5 bullet
points/short sentences. 4 pts
Over-training can result in overfitting, where the model becomes too
specialized to the training data
○ Over-trained models may be more prone to memorizing specific examples or
noise in the training data
○ Over-training can increase the risk of biased or inaccurate predictions
Protein Language Models (pLMs) generate embeddings that are used as input to
methods predicting protein secondary structure. Speculate why those methods reach
the performance of MSA-based (multiple sequence alignment) methods (≤3 reasons;
3 pts).
pLMs are trained on large amounts of data and can learn complex patterns in
protein sequences
○ pLMs can capture long-range dependencies between amino acids in a protein
sequence
○ pLMs can be used to predict protein properties beyond secondary structure,
such as protein stability and binding affinity
Describe one way to test whether or not pLM-based capture evolutionary information
(≤3 bullets; 3 pts).
○ One possible way to test whether or not pLM-based capture evolutionary
information is to compare the pLM embeddings of protein sequences with
different levels of evolutionary relatedness.
○ Alternatively, one could use the pLM embeddings as input to a classifier that
predicts the evolutionary category of a protein sequence, such as family,
superfamily, fold, or class, and see how well the classifier performs on
different categories
How can pLM-based protein prediction save energy/ resources (≤2
bullets/sentences; 2 pts)?
The use of pLMs can help reduce the amount of experimental work needed to
determine protein structures and functions, which can save energy and
resources
Bonus question: are larger pLMs guaranteed to outperform smaller ones (Y/N and
argue; ≤3 bullets; 3 pts).
Size of the model does not seem to be the determining factor, so larger
Models are not guaranteed to outperform smaller ones
○ It is rather about training time
Pros & Cons scRNA-Seq over Bulk Seq
What does the ideal single cell transcriptomics method look like
Universal (can be applied to every cell)
○ in situ measurements(without removing or changing it from its original
condition)
○ no minimum number of cells and every cell is captured and assayed
○ transcript have full-length sequence
○ multimodal (measuring different things at one, e.g. gene accessibility and
transcriptomics)
○ no doublets, transcripts are assigned correctly to cell
○ easy to use, open source cost effective
Name scRNA-Seq Method
○ InDrop (explained in lecture) for rare cells, only “common” sensitivity.
○ DropSeq for abundant cells
○ 10X
○ (Mars-seq, Split-seq …)
probability that droplet has k beads or k cells
What is the cell capture rate?
The cell capture rate is the probability that a droplet has at least one bead
What is the cell duplication rate?
The cell duplication rate is the rate at which captured single cells are
associated with two or more different barcodes
what are doublets?
○ When a barcode is associated with two or more cells
○ synthetic: Barcode Collisions
● what are barcodes?
beads to identify cells in droplets
what is CITE-Seq?
Barcode +Antibody -> see where cell surface proteins are
what's the advantage of UMIs? what are UMIs?
unique molecular identifiers -> identify transcripts in cell
○ help prevent PCR-bias (GC bias etc.)
what is the typical analysis workflow?
Start with raw data, that is transferred to count matrices
○ Count matrixes now go through quality control, correction and normalisation
○ Next step is visualisation and clustering
○ Downstream analysis can be trajectory inference, differential expression or
compositional analysis (e.g compare the abundance and composition of two
patients)
Barcodes with a low count depth, few detected genes, and a high fraction of
mitochondrial counts are indicative of cells
that are dead
Cells with unexpectedly high counts and a large number of detected genes may
represent
doublets.
Batch correction:
Check if data clusters by batch first and not by cell type
○ Combat is method to do Batch Effect
PCA:
Orthogonal linear transformation
○ Captures as much variance as possible○ Good at showing global structure
○ Poor at resolving local similarities
○ Sensitive to outliers
○ Not able to capture non-linear relations
tSNE (t-Distributed Stochastic Neighbor Embedding):
○ Idea:
The idea behind tSNE is to reduce the dimensions through a non-
linear transformation and thereby retain the existing clusters. In doing
so, tSNE itself does not perform any clustering and is only used for
visualization, although it is classified as unsupervised.
Steps:
1. select “neighbors” w.r.t. Gaussian distribution over points in high
dimensional space
■ 2. select “neighbors” w.r.t. t-distribution (1df = Cauchy) over points in
low dimensional space
■ 3. minimize Kullback-Leibler divergence between both distributions
using gradient descent
UMAP (Uniform Manifold Approximation and Projection):
Just like tSNE, UMAP is also a non linear algorithm for reducing the
dimensionality of data, while preserving local similarities as well as
global distances
■ Compared to tSNE, UMAP has the advantage of being more scalable
in terms of the computing power required
■ algorithm that is only used for visualization and not for clustering
1. Step: approximate manifold for data in high dimensional space
using simplical complexes as neighborhood graph
■ 2. Step: approximate distances for data in low dimensional space
using spectral embedding
■ 3. Step: Optimize low dimensional fuzzy topology to be similar to high
dimensional fuzzy topology via fuzzy set cross entropy
Pitfalls of non-linear transformation
Cluster size is meaningless
○ Distances between clusters are meaningless
○ Patterns Can Be Misleading
Cluster analysis:
Organizing cells into clusters is typically the first result of any single-cell
analysis.
○ Clusters allow us to infer the identity of member cells.
○ Clusters group cells based on the similarity of their gene expression profiles.
○ Expression profile similarity is determined via distance metrics
○ Two approaches exist to generate cell clusters from these similarity scores:
■ clustering algorithms and
■ community detection methods
Trajectory Analysis
To capture transitions between cell identities, branching differentiation
processes, or gradual, unsynchronized changes in biological function, we
require dynamic models of gene expression.
○ Trajectory inference methods interpret single-cell data as a snapshot of a
continuous process.
○ This process is reconstructed by finding paths through cellular space that
minimize transcriptional changes between neighbouring cells.
○ this variable is related to transcriptional distances from a root cell, it is often
interpreted as a proxy for developmental time.
Metastable States
dense regions along a trajectory indicate preferred transcriptomic state
○ can be found by plotting histograms of pseudo time coordinate
Cell level Analysis Unification
unification if cell clustering and trajectory Analysis
○ representing single cell clusters as nodes, and trajectories between the
clusters as edges
RNA velocity:
The balance between unspliced and spliced mRNAs is predictive of cellular
state progression
Advanced RNA-seq Methods
Single-end
Reads are only produced from a single end
Paired-end
Reads are produced from one end of the fragment as well as another read
from the opposite end of the fragment
Multiplexing
○ Individual 'barcode' sequences are added to each sample
RPKM and FPKM
Count the total reads in a sample and divide it by 1M => "per million" scaling factor○ Divide the reads counts by the "per million" scaling factor to normalize for
sequencing depth, giving you reads per million (RPM)
○ Divide the RPM values by the length of the gene, in kilobases. This gives you
RPKM
○ If you have paired end reads, divide by two to get FPKM
TPM
Divide the read counts by the length of each gene in kilobases => reads per
kilobase (RPK)
○ Count up all the RPK values in a sample and divide it by 1M
○ Divide the RPK values by the "per million" scaling factor
Why are TPM and FPKM considered within-sample measures of gene expression?
Name two between-sample biases they do not capture
RNA composition: the relative abundance of different types of RNA molecules
in a sample
○ Library efficiency, protocol, tissue differences
Differential Expression analysis:
Differential expression analysis is a statistical method used to identify genes
that show significant changes in expression between different conditions or
groups in RNA-Seq data.
Why not use a t-test on the read counts?
Bias in the data (sequencing depth, library efficiency, etc. need to be
addressed)
■ TPM does not address differences in library efficiency, protocol, tissue
differences
■ sample size is commonly too low to estimate variance
○ Solution: Borrow information across genes! -> DESeq2
RNA-seq data follows a negative binomial distribution
Count processes in general are described by a Poisson distribution where
mean and variance are equal.
○ not the case for RNA-seq counts, a phenomenon known as overdispersion
○ The negative binomial distribution takes this into account through a dispersion
parameter alpha
What is the source of overdispersion?
overdispersion: higher variance than Poisson
○ transcript is present at slightly different levels in each sample
DESeq2
read count model: typically estimated by the median-of-ratios method to
account for compositionality and sequencing depth.
○ dispersion shrinkage
■ 1. Treat each gene separately and estimate
dispersion (black points)
■ 2. Fit a smoothing curve (red)
■ 3. Shrink gene specific dispersion towards the
expected dispersion to get more robust results using
the Bayes approach
DESeq2:
Strength of shrinkage depends on:
How close are dispersion values to the fit?
■ Degrees of freedom, shrinkage is reduced for larger samples numbers
log fold change (LFC):
common difficulty in the analysis of HTS data is the strong variance of
LFC estimates for genes with low read count
short read vs long reads
Short read RNA-Seq technology enables accurate quantification of
exons/events
○ Long read RNA-Seq technology captures the correct full-length isoforms
Star
Seed searching
STAR aligns reads and searches for the longest sequence that
matches one or more locations on the reference genome, known as
Maximal Mappable Prefixes (MMPs)
STAR
extend MMPs
If STAR does not find an exact matching sequence for each part of the
read due to mismatches or indels, the previous MMPs will be
extended
soft clipping
■ If extension does not give a good alignment, then the poor quality or
adapter sequence will be soft clipped
cluster, stitching, scoring
separate seeds are stitched together to create a complete read by first
clustering the seeds together based on proximity
Why so fast?
uncompressed suffix arrays
splAdder
Source:
An initial graph G out of a genome annotation
■ A list of junctions from RNA-Seq
Add novel cassette exons into the graph G
■ Add novel intron retentions into the graph G
■ Add novel intron edges into the graph G
■ An augmented graph Ĝ
○ Don’t compare result values from different methods
Advantages of Pseudo alignment
allows for a faster estimation of transcript abundance then traditional
alignment based methods, without losing to much accuracy
○ Can classify diseases, understand expression changes, track cancer
○ progression
KALISTO
input: transcriptome, rna seq reads
○ output: Kallisto Index, Quantification
○ first step: build Kallisto Index
■ split transcriptome into k-mers
■ build de Bruijn Graph from k-mers, while remembering from which
transcript each k-mer is(k-compatibility class)
■ define linear section with the samek-compatibility class as a contig■ build a Kallisto Index: Hashmap that has each k-mer as key and as
value the contig it is in and the position on the contig
○ second step: pseudo alignment of reads
■ each read is split into k-mers
■ look up first k-mer of read in Kallisto index
■ from position of k-mer in contig, calculate which k-mer in read would
be the last in same contig
■ look up this k-mer to double check
■ look up next k-mer in new contig
■ intersect the k-compatibility class of all k-mers, that were looked up
○ last step: quantification:
■ now we work with equivalence class instead of k-compatibility class
■ equivalence class is set of of transcripts a read could stem from
■ use Expectation-Maximization algorithm to find the best possible
transcript abundances that explains the equivalence class counts
■ not mentioned in lecture: EM is repeated with bootstrap which allows
an estimate of the mean standard error
Techniques to quantify cells: flow cytometry
flow cytometry: characterize and define different cell types in an
heterogeneous cell population by analysing the expression of cell surface and
intracellular molecules
○ Immunohistochemistry (IHC staining): Immunohistochemistry uses antibodies
to bind to proteins in tissues. Using layers of coloured/fluorescent complexes,
it is possible to visualize location of specific cell types.
Deconvolution Problem
The deconvolution problem involves separating or extracting individual
components from a mixed signal or observation.
○ It requires estimating the contributions or proportions of each component to
recover the original sources or signals from the observed data
miRNA
microRNA
○ small (19-22 nc) non-coding RNA that downregulates gene expression by
binding in 3’ utr, acting in a complex with an Argonaute protein
siRNA
small interfering RNA
○ small (21-24 nc) non coding RNA that downregulates gene expression by
binding to RNA with complementary sequence and preventing translation
○ show perfect complementarity and high specificity
how does microrna regulate gene expression
They bind to the 3’-UTR (untranslated region) of their target mRNAs and
repress protein production by destabilizing the mRNA and translational
silencing
What can be used to computationally predict miRNA binding
miRNA seed pairing, Hybridization energy, Conservation of miRNA binding
sites, Accessibility of binding sites
what is Competing endogenous RNAs
In conclusion, we hypothesize that crosstalk between RNAs, both coding and
noncoding, through MREs, forms large-scale regulatory networks across the
transcriptome”- A ceRNA Hypothesis: The Rosetta Stone of a Hidden RNA
Language?: Cell
○ miRNA binding sites “competing for miRNA”
What is SPONGE
can be used to infer genome wide ceRNA interaction networks for biomarker
discovery
○ The competing endogenous RNA (ceRNA) hypothesis suggests that mRNAs
that possess binding sites for the same miRNAs are in competition. This
motivates the existence of so-called sponges, i.e., genes that exert strong
regulatory control via miRNA binding in a ceRNA interaction network
○ Difference to previous methods: considering combinatorial effect of several
miRNAs
Steps of SPONGE
Input: Gene Expression, miRNA Expression, Targetscan (db that predicts
targets of miRNA) miRcode(db for microRNA target sites)
○ Gene miRNA seed match and regression filter (many false positives are
filtered out by considering tissue specificity)
○ for every gene pair with shared miRNA calculate mscor(compute multiple
miRNA sensitivity correlation, represents how much their co-expression can
be explained by miRNA)
○ Null model based significance analysis, where null hypothesis miRNAs do not
affect the correlation between two genes, i.e. mscor=0
○ Network Construction by connecting significant genes
Why are proteins of interest
Proteins execute and control the vast majority of biological processes
What is proteomics
large scale analysis of proteins
What are the types of questions proteomics may address
Identification of potential drug targets
○ Investigation of disease mechanisms
○ Comparison of normal and diseased tissues
○ Protein expression across the cell cycle
○ Characterisation of cell types and tissues
○ Isolation of protein complexes and interaction networks
○ Investigation of biochemical pathways
What are amino acids
Amino acids are organic compounds that contain amino (–NH2) and carboxyl
(– COOH) functional groups, along with a side chain (R group) specific to
each amino acid
○ encoded directly by triplet codons
Why and how are protein digested
Digestion into peptides is required because the downstream separation and
mass spectrometry cannot handle complex mixtures of intact proteins!
○ protease Trypsin cleaves after K and R
○ Very stable and efficient
○ Modified trypsin to overcome autolysis problems and “proline problem”
What is elution
washing out substances from a solid material using a liquid or a gas
Liquid vs. Gas Chromatography
What is the difference between online and offline Chromotographie
In offline chromatography, the sample is first loaded onto the column and
then the column is disconnected from the system to be eluted. The eluted
sample is then collected in fractions for further analysis.
○ in online chromatography, the sample is loaded onto the column and then
eluted while still connected to the system. The eluted sample is then analysed
in real-time.
How can you separate peptides with the same mass but different
sequences/structure
They will have a different retention time (the time the analyte needs to pass
through the chromatographic column)
○ The mass spec will measure the mass of the analytes at different retention
times
What are the most important parts of mass spectrometer
A mass spectrometer is an instrument that measures the m/z of individual
ions
○ ionizer: ionizing the peptides we will analyse
○ Mass analyser: measures moving charges and determines m/z by recording
the frequency with which ions are oscillating along the z axis
○ Detector: detecting quantity
What are the different kind of masses
Nominal (or integer) mass: the sum of the integer masses of all elements in a
moleculeo Chemical or average mass: the sum of the masses of all stable isotopes of all
elements in a molecule weighed by their natural abundance
o Monoisotopic mass: the sum of the masses of the most abundant stable
isotopes of all elements in a molecule
Calculate the mass of Protein TEST
○ First add up the masses of the amino acid
○ 't': 119.06, 'e': 147.05, 's': 105.04 ->490.210
○ subtract a H20 for every peptide bond ->490.210-3*(2*1.0078+15.9949)
○ neutral mass done: 439.17850
○ for charged mass add mass of proton
○ for m/z dived through charge
calculating y and by fragments
y: reverse peptide, calculate mass add one proton and one H20
○ b: calculate mass add one proton
what are ms1 and ms2 scans
MS1 is the first stage of mass analysis, where the mass range of interest is
scanned and the precursor ions are selected.
○ MS2 is the second stage of mass analysis, where the selected precursor ions
are fragmented into smaller product ions and detected
What is the use of isotopes?
identify and quantify elements in sample by measuring m/z
○ identify biological and chemical origin of a protein
What is Tandem-MS?
peptides get fragmented, resulting in fragments that go either from the start or
the end of the peptide to the cut
○ the different fragments make it possible to reconstruct the sequence
Standard bottom-up workflow?
Extract protein and prepare it
○ Digest Protein with trypsin
○ Separate peptides after certain attributes (e.g. hydrophobic)
○ MS1 to get relative abundance from analytes
○ MS2 to get sequences of analytes
○ Peptide identification and quantification
○ Protein list or Matrix
○ Functional bioinformatics analysis
What is Mass Spectrum?
A 2-dimensional plot with the mass-to-charge ratio on the abscissa and the
abundance on the ordinate representing the abundance of chemical (ionized)
compounds in a mixture or sample solution as peaks
What is Mass spectrometry?
A mass spectrometer is an instrument that measures the masses of individual
molecules (or fragments thereof) that have been converted to ions
What is multiplexing?
Until now, only one sample can be measured by mass spectrometry
○ Multiplexing is a method by which multiple signals are combined into one
signal over a shared medium○ SILAC utilizes heavy cell culture media containing lysines and arginines
which contain either exclusively monoisotopic C and N, shifting the mass
spectrum
What is a Orbitrap
○ mass analyser and detector in one
○ outer electrode and central spindle
○ Ions are injected and oscillate stably around spindle
○ 3-dimensional movement of ions within orbitrap
○ The right to left movement (frequency) is used to determine the m/z (using the
fourier transformation)
○ resolution depends on the number of oscillations
Deriving peptide masses from small graphs
Trypsin (largely) generates peptides ending in K and R ->lok for
corresponding mass
○ Check all following peaks if they have a mass difference of a known amino
acid residue
De Novo Peptide Sequencing
o Then looks at the MS2 graph and checks at all the pairwise differences
between the peaks, makes a graph of them from which you can read all
possible sequences and gives a list of the sequences as output
o Problems:
▪ Peptide fragmentation does not always result in detectable ions -
>Missing y- and b-ions in spectrum
▪ Intensity of ions is not “predicable” (but reproducible)
▪ There might be more than one interpretation of the spectrum
▪ Not really used anymore unless there is no proteomics data for the
organism
What is the difference between ion traps and orbitraps
Both mass analyser that measure m/z
○ Orbitraps have a high resolution at low m/z, but low resolution at high m/z
○ ion traps have constant resolution
When and how do you apply tolerance in DA
constant resolution(io traps)
○ simply subtract and add tolerance to neutral mass to get lower and upper
bound
○ example 0.5 DA: 464.7347++ ->927.45 ->926.95-927.95
When and how do you apply tolerance in ppm
non constant resolution (Orbitraps)
○ divide neutral mass by 1.000.000, multiply with 15. Result is tolerance in DA
to be applied
○ example 15ppm: 464.7347++ ->927.45 +/-15ppm ->927.45 +/- (927.45/ 1M)
*15 ->927.45 +/- 0.014
How does database search work?/How to Build a database search engine to identify
proteins in mass spectrum
Calculate neutral peptide mass
○ search database for all digested proteins with matching mass (apply
resolution tolerance)
○ create spectrum for candidates (Fragmentation and therefore intensity ca be
predicted)
○ compare made spectrum against experimental and match peaks
How do you estimate the false discovery rate
What is protein interference
As input we have proteins, that are digested and measured in a mass
spectrometer
○ Then we try to identify the peptides that make up the spectrum(find ~25%)
○ Try to infer proteins from peptides
what is spectral library searching
○ a technique used in mass spectrometry to identify unknown compounds by
comparing their spectra to a library of known spectra
What are the advantages and disadvantages of DP searching and spectral library searching
how to compare spectra
each peak is a tupel (m/z, intensity), spectrum is list of tupels
○ m/z of two spectra are never completly the same (uncertainty because both
come from experiments) → use tolerance to align m/z (we have one m/z listand two lists of intensities (if there is no peak at a certain m/z in one intensity
list its 0), so now we can compare)
○ Different questions to ask: are spectra identical? Is one spectrum contained in
another?
○ Normalization: p2-norm: all vectors have the same length of 1
○ Similarity measurement: spectral angle, Pearson correlation, many other
options though
How to create decoys for Spectral library searching
For peptides, one can use pre identified spectra and “move” annotated (e.g. y
and b fragment ions) peaks around. Keeping intensity of the original
fragments, keeping retention time of the original peptide identification
○ not easy for metabolites
What is a feature finding
Map: Three dimensional data set (RT, m/z, intensity) containing the MS signal
from one LC MS run
○ Feature: The sum of all the MS signals caused by the same analyte/peptide
○ Feature finding: Finding the set of features explaining as much of the signal in
a map as possible.
What are different ways to detect Features
finding local maxima + persistance (reduces the number of maxima, have to
guess this value)
○ Fitting distributions: we know m/z + intensity the peptides have normal
distribution → fit many normal distributions (guess how many distributions)
○ convolutions: slide normal distributions along a spectrum and see if signal like
this is there (computationally intense) a bit like Fourier transformation
○ continues wavelet transformation: in different runs a wave (Mexican hat
wavelet) with a certain size slides over the spectrum and sees if any peaks fit
it, the size is is increased with each run to account for the different shapes
and heights of peaks (is done on retention time and m/z dimension)
○ Average model: (in RT dim) calculate the average aa, how many of these aas
do we need to make up the numeric mass of a peptide? → what does the
distribution of isotopes of a peptide of this mass look like; tells us where the
peaks will probably be, look if we can find the matching ones
What is indexed retention time
What is the goal of linking features
Goal: finding out of some peptides go up or down in different conditions
(linking several different experiments)
○ Find features in healthy and disease maps
○ Align maps
○ link corresponding features
■ Problem: Features might not be identified in all measurements■ With perfect measurement we could transfer identifications, by
matching m/z, isotope envelope, charge, retention time and filling up
missing values -> match between runs
■ But difficult because of fluctuating retention time
■ normalize with indexed retention time
■ Alternatively: fitting a line or curve to retention time between different
measurements
■ Dynamic time warping, which aligns two or more time series(retention
times)
■ bidirectional best hits: if feature 1 is the most similar to feature 2 and
the other ways around, they are bidirectional best hits
○ identity features
○ Quantify
Computational problem in mass spectrometry when having many proteins
Scalability: As the number of proteins increases, so does the size and
complexity of the search space and the computational resources required to
perform the analysis.
○ Ambiguity: Many proteins share similar or identical peptide sequences, which
can make it difficult to assign a unique protein identity to a given mass
○ Combinatorial complexity: When analysing complex mixtures of proteins,
such as whole proteomes or protein complexes, there can be many possible
combinations of peptides that explain a given mass spectrum
DeepLearning in MS
PointIso - Detecting peptide features
○ Prosit – peptide fragment intensity prediction
○ DeepLC – peptide retention time prediction
○ Metabolite retention time prediction
○ MS2DeepScore –spectrum similarity learning
○ Cassanovo–peptide de novo sequencing
○ Frage Stunde
Zuletzt geändertvor 2 Tagen