Klausurfragen

Buffl

BioInfo

von Katerina M.

what is a domain in a protein?

a 3D unit that folds on its own

Fair

○ Findable

○ Accessible

○ Interoperable

○ Reusable

Define protein structure & protein function (each max 3 bullets) 3 pts

○ Function = anything that happens to or through a protein

○ sequence defines structure, protein structure is the shape of a protein in

which it folds in its natural ‘unbothered’ state

○ structure defines function

○ protein function is diverse, can not be easily defined or classified

○ Functions: Defense (e.g. antibodies), Structure (e.g. collagen), Enzymes –

metabolism, catabolism, Communication / Signaling (e.g. insulin), Ligand

binding / Transport (e.g. hemoglobin), Storage (e.g. ferritin)

Protein structure in 1D, 2D, 3D: describe (each max 1 bullet) 3 pts

○ 1D: sequence or string of secondary structure states

○ 2D: inter-residue distances and visualization in a graph(distance map)

○ 3D: coordinates of protein structure and visualization as a picture

Speculate: why do most proteins have more than one structural domain?

Domains are parts of proteins that have specific functions or structures.

Proteins can have multiple domains to perform complex functions or create

new functions by recombining domains.

Experimental high-resolution structures are known for fewer than 10% of the ~20k

human proteins. Give three reasons (3 bullets) 3pts.

○ expensive

○ time intensive

○ hard to extract and or analyse different proteins

Most methods predicting secondary structure in 3 states (helix, strand, other) predict

strands much worse than helix. Comment in fewer than 5 bullet points (telegram

style; 5 pts).

○ There is much more data on helixes than on strands

○ This imbalance in the data leads to helices being predicted better

○ Balancing the classes with over/undersampling leads to both classes being

predicted similarly well

○ helices are local, sheets do not have to be

Helices are stabilized by hydrogen-bond formation, typically between residues i and i+4. Does this mean secondary structure to be a 2D feature of structure (≤2 bullets/sentences; 2 pts).

Secondary structure is not a 2D feature of structure but a 1D feature because a position can be categorized into three fold states (H: helix, E: strand, O: other) and therefore the possible predictions are only in one dimension as it is a string.

Q3 measures the percentage of residues correctly predicted in one of 3 states (H: helix, E: strand, O: other). Methods predicting H|E|O equally well may have a lower Q3 than those predicting O best. Why? (1 bullet, 2 pts)

majority of datapoint is o -> better at o means more correct predictions overall

Predicting secondary structure through one-hot-encoding implies that each residue

position is represented by 21 binary units. Why not 20? What else could you do?

Speculate why your alternative has not been used in existing methods (5 pts).

In one-hot encoding each amino acid is represented by a unit binary vector of

length n, containing a single one and n-1 zeros ([1,0,0, ... ., O] for one amino

acid). When predicting secondary structure through one-hot-encoding one

uses 21 binary units because the "." encodes for "non protein unit" which can

be used for padding

Why do methods using evolutionary information perform better than those using one-

hot encoding (≤2 bullets/sentences; 2 pts).

Evolution conserved the regions critical for the right structure, so using these

regions avoids noise and focused on the relevant regions for the structure

(compared to just using the amino acids in a window nearby)

○ MSAs can be used to generate Positional Specific Scoring Matrices (PSSMs)

which can be taken as input for training and has demonstrated great potential

in predicting protein structure and function.

What do you need to consider in the comparison of methods predicting, e.g.

secondary structure (you get a table and have to choose the number of digits, rank

methods, note issues, asf).

you need to consider the error rate, how the different scores differentiate and

the confidence interval

○ dataset & testing

○ how good is a random prediction

○ runtime

THE breakthrough in protein prediction originated from using evolutionary information

(originally in 1992 to predict secondary structure). Where do you get evolutionary

information from (≤2 bullets/sentences; 2 pts).

Multiple sequences alignment (from tools like BLAST)

○ MSAs are a way of encoding evolutionary information by aligning several

similar proteins

○ . MSAs can be used to generate positional Specific Scoring Matrices PSSMs

○ PSSMs can be taken as input for training and has demonstrated great

potential in predicting protein structure and function

What should you do if your score on the test set is better than the score on the

training set?

This is not possible, as the ‘top score possible’ on the training set is the holy

grail of knowledge for the model. There must be a mistake so start from

scratch.

What are the major improvements/breakthroughs of AlphaFold?

○ it also predicts the quality of its predictions

○ Feature preprocessing: The input protein sequence is used to search for

similar sequences and known structures in large databases, using tools like

MMseqs2 and HHsearch. The results are used to generate multiple sequence

alignments and templates that capture the evolutionary and structural

information of the protein.

○ The multiple sequence alignments and templates are fed into a deep neural

network that predicts the distances and angles between pairs of amino acids,

as well as the confidence of these predictions. These predictions are then

converted into 3D coordinates using a gradient descent algorithm.

○ The predicted 3D coordinates are refined by a molecular dynamics simulation

that minimizes the potential energy of the protein structure and makes it more

physically realistic.

Describe an AI (Artificial Intelligence)/ML (machine learning) method that predicts

sub-cellular location (or cellular compartment) in three classes (c: cytoplasmic, e:

extra-cellular, n: nuclear). Make sure to explain how to cope with the fact that

proteins have different lengths and AI/ML models need fixed input. 6 pts.

Train three binary classifiers (neural network) one for each sub-cellular

location (cytoplasmic c, extra-cellular e, nuclear n) and connect them into one

prediction scheme.

○ To cope with the fact that proteins have different lengths and Al/ML models

need fixed input we use the amino acid composition as input because

proteins have intrinsic signals that govern the transport and localization in the

cell.

How can sequencing mistakes challenge per-protein predictions? 3 pts.

because a prediction tool can only be as good as the data on which it was

trained on. If those data contain sequencing mistakes, then more prediction

mistakes will be done by the tool due to misleading training data

You want to develop a method that predicts binding residues (e.g. enzymatic activity

and DNA-binding). Your entire data set of proteins with experimentally known binding

residues amounts to 500 sequence- unique proteins with a total of 5,000 binding and

45,000 non-binding residues. You can only use a simple artificial neural network (of

the feed-forward style) with one hidden layer, but the complexity of your problem

demands at least 100 hidden units. Thus, even a simple one- hot encoding (or

evolutionary information) for a single residue with 20 units is not supported by the

data. Explain why. What could you do, instead?

Rule of thumb: need 10 times more data points than free parameters

○ number of free parameters in this model: 20*100+100*2=2200○ Why: because these input nodes gave the best performance?

○ Instead: encode group together amino acids with similar abilities (like acidic or

positively charged…)

What is redundancy reduction? How do you do it?

○ Generally removing too similar data

○ Hard to define what to remove in practice

○ Choose a threshold depending on the problem e.g. define protein families as

being similar

Protein Language Models (pLMs) copy models from Natural Language Processing

(NLP). Those learn grammar by putting words in their context in sentences. Name

the three analogies for grammar|word|sentence in pLMs? 3 pts

Protein Language Models (pLMs) apply methods from the field of natural

language of life to capture the intrinsic language of proteins. While a single

word can have a different meaning depending on its context and sentence, an

amino acid can have a different effect depending on its surrounding residues.

word: amino acid sentence: protein sequence grammar: combination of words

which result in meaning

What problem do pLMs address?

Can leverage large, unlabelled datasets

○ Can find new representations automatically (data-driven) even for domains

which were hard to formalize (NLP, CB)

○ Outperforms handcrafted features in many cases

○ pLMs use end-to-end transformers to go from sequence-to-sequence. The

second to last layer of the neural network can be read to obtain a so-called

embedding which is a great numerical representation of the protein. These

embeddings can be used in machine learning tasks to predict features of a

protein.

What is the meaning of embeddings from pLMs? 3 pts.

Embeddings are a machine-readable representation of protein sequences by

converting text into vectors of numbers representing relevant features or

descriptors of proteins is an important first step to find out properties of the

protein with that sequence, e.g., what other proteins it resembles (sequence

comparisons through alignments), what it looks like (membrane or water-

soluble, regular globular or disordered), or what it does (enzyme or not,

process involved in, molecular function, interaction partners).

How can we profit from pLMs for protein prediction?

We can use them as input

○ It captures more features than expert crafted Models

What is the difference between per-residue and per- protein embeddings?

Embeddings are a machine-readable representation from protein sequences

by converting text into vectors of numbers representing relevant features.

These come in two flavors: per-residue and per-protein. While the per-residue

embeddings are taken directly out of the LMs, per-protein embeddings are

generated post-processing the information extracted by the LM through globalaverage pooling on all combined per-residue embeddings of a sequence. Per-

residue embeddings are useful to analyze properties of residues in a protein

e.g., which residues bind ligands, while per-protein representations capture

annotations describing entire proteins e.g., native localization.

Bonus question: pLMs originate from CNNs that use predict sequences from

sequences. Does it matter whether those are over-trained or not. Explain in <5 bullet

points/short sentences. 4 pts

Over-training can result in overfitting, where the model becomes too

specialized to the training data

○ Over-trained models may be more prone to memorizing specific examples or

noise in the training data

○ Over-training can increase the risk of biased or inaccurate predictions

Protein Language Models (pLMs) generate embeddings that are used as input to

methods predicting protein secondary structure. Speculate why those methods reach

the performance of MSA-based (multiple sequence alignment) methods (≤3 reasons;

3 pts).

pLMs are trained on large amounts of data and can learn complex patterns in

protein sequences

○ pLMs can capture long-range dependencies between amino acids in a protein

sequence

○ pLMs can be used to predict protein properties beyond secondary structure,

such as protein stability and binding affinity

Describe one way to test whether or not pLM-based capture evolutionary information

(≤3 bullets; 3 pts).

○ One possible way to test whether or not pLM-based capture evolutionary

information is to compare the pLM embeddings of protein sequences with

different levels of evolutionary relatedness.

○ Alternatively, one could use the pLM embeddings as input to a classifier that

predicts the evolutionary category of a protein sequence, such as family,

superfamily, fold, or class, and see how well the classifier performs on

different categories

How can pLM-based protein prediction save energy/ resources (≤2

bullets/sentences; 2 pts)?

The use of pLMs can help reduce the amount of experimental work needed to

determine protein structures and functions, which can save energy and

resources

Bonus question: are larger pLMs guaranteed to outperform smaller ones (Y/N and

argue; ≤3 bullets; 3 pts).

Size of the model does not seem to be the determining factor, so larger

Models are not guaranteed to outperform smaller ones

○ It is rather about training time

Pros & Cons scRNA-Seq over Bulk Seq

What does the ideal single cell transcriptomics method look like

Universal (can be applied to every cell)

○ in situ measurements(without removing or changing it from its original

condition)

○ no minimum number of cells and every cell is captured and assayed

○ transcript have full-length sequence

○ multimodal (measuring different things at one, e.g. gene accessibility and

transcriptomics)

○ no doublets, transcripts are assigned correctly to cell

○ easy to use, open source cost effective

Name scRNA-Seq Method

○ InDrop (explained in lecture) for rare cells, only “common” sensitivity.

○ DropSeq for abundant cells

○ 10X

○ (Mars-seq, Split-seq …)

probability that droplet has k beads or k cells

What is the cell capture rate?

The cell capture rate is the probability that a droplet has at least one bead

What is the cell duplication rate?

The cell duplication rate is the rate at which captured single cells are

associated with two or more different barcodes

what are doublets?

○ When a barcode is associated with two or more cells

○ synthetic: Barcode Collisions

● what are barcodes?

beads to identify cells in droplets

what is CITE-Seq?

Barcode +Antibody -> see where cell surface proteins are

what's the advantage of UMIs? what are UMIs?

unique molecular identifiers -> identify transcripts in cell

○ help prevent PCR-bias (GC bias etc.)

what is the typical analysis workflow?

Start with raw data, that is transferred to count matrices

○ Count matrixes now go through quality control, correction and normalisation

○ Next step is visualisation and clustering

○ Downstream analysis can be trajectory inference, differential expression or

compositional analysis (e.g compare the abundance and composition of two

patients)

Barcodes with a low count depth, few detected genes, and a high fraction of

mitochondrial counts are indicative of cells

that are dead

Cells with unexpectedly high counts and a large number of detected genes may

represent

doublets.

Batch correction:

Check if data clusters by batch first and not by cell type

○ Combat is method to do Batch Effect

PCA:

Orthogonal linear transformation

○ Captures as much variance as possible○ Good at showing global structure

○ Poor at resolving local similarities

○ Sensitive to outliers

○ Not able to capture non-linear relations

tSNE (t-Distributed Stochastic Neighbor Embedding):

○ Idea:

The idea behind tSNE is to reduce the dimensions through a non-

linear transformation and thereby retain the existing clusters. In doing

so, tSNE itself does not perform any clustering and is only used for

visualization, although it is classified as unsupervised.

tSNE (t-Distributed Stochastic Neighbor Embedding):

Steps:

1. select “neighbors” w.r.t. Gaussian distribution over points in high

dimensional space

■ 2. select “neighbors” w.r.t. t-distribution (1df = Cauchy) over points in

low dimensional space

■ 3. minimize Kullback-Leibler divergence between both distributions

using gradient descent

UMAP (Uniform Manifold Approximation and Projection):

○ Idea:

Just like tSNE, UMAP is also a non linear algorithm for reducing the

dimensionality of data, while preserving local similarities as well as

global distances

■ Compared to tSNE, UMAP has the advantage of being more scalable

in terms of the computing power required

■ algorithm that is only used for visualization and not for clustering

UMAP (Uniform Manifold Approximation and Projection):

Steps:

1. Step: approximate manifold for data in high dimensional space

using simplical complexes as neighborhood graph

■ 2. Step: approximate distances for data in low dimensional space

using spectral embedding

■ 3. Step: Optimize low dimensional fuzzy topology to be similar to high

dimensional fuzzy topology via fuzzy set cross entropy

Pitfalls of non-linear transformation

Cluster size is meaningless

○ Distances between clusters are meaningless

○ Patterns Can Be Misleading

Cluster analysis:

Organizing cells into clusters is typically the first result of any single-cell

analysis.

○ Clusters allow us to infer the identity of member cells.

○ Clusters group cells based on the similarity of their gene expression profiles.

○ Expression profile similarity is determined via distance metrics

○ Two approaches exist to generate cell clusters from these similarity scores:

■ clustering algorithms and

■ community detection methods

Trajectory Analysis

To capture transitions between cell identities, branching differentiation

processes, or gradual, unsynchronized changes in biological function, we

require dynamic models of gene expression.

○ Trajectory inference methods interpret single-cell data as a snapshot of a

continuous process.

○ This process is reconstructed by finding paths through cellular space that

minimize transcriptional changes between neighbouring cells.

○ this variable is related to transcriptional distances from a root cell, it is often

interpreted as a proxy for developmental time.

Metastable States

dense regions along a trajectory indicate preferred transcriptomic state

○ can be found by plotting histograms of pseudo time coordinate

Cell level Analysis Unification

unification if cell clustering and trajectory Analysis

○ representing single cell clusters as nodes, and trajectories between the

clusters as edges

RNA velocity:

The balance between unspliced and spliced mRNAs is predictive of cellular

state progression

Advanced RNA-seq Methods

Single-end

Reads are only produced from a single end

Advanced RNA-seq Methods

Paired-end

Reads are produced from one end of the fragment as well as another read

from the opposite end of the fragment

Advanced RNA-seq Methods

Multiplexing

○ Individual 'barcode' sequences are added to each sample

RPKM and FPKM

Count the total reads in a sample and divide it by 1M => "per million" scaling factor○ Divide the reads counts by the "per million" scaling factor to normalize for

sequencing depth, giving you reads per million (RPM)

○ Divide the RPM values by the length of the gene, in kilobases. This gives you

RPKM

○ If you have paired end reads, divide by two to get FPKM

TPM

Divide the read counts by the length of each gene in kilobases => reads per

kilobase (RPK)

○ Count up all the RPK values in a sample and divide it by 1M

○ Divide the RPK values by the "per million" scaling factor

Why are TPM and FPKM considered within-sample measures of gene expression?

Name two between-sample biases they do not capture

RNA composition: the relative abundance of different types of RNA molecules

in a sample

○ Library efficiency, protocol, tissue differences

Differential Expression analysis:

Differential expression analysis is a statistical method used to identify genes

that show significant changes in expression between different conditions or

groups in RNA-Seq data.

Why not use a t-test on the read counts?

Bias in the data (sequencing depth, library efficiency, etc. need to be

addressed)

■ TPM does not address differences in library efficiency, protocol, tissue

differences

■ sample size is commonly too low to estimate variance

○ Solution: Borrow information across genes! -> DESeq2

RNA-seq data follows a negative binomial distribution

Count processes in general are described by a Poisson distribution where

mean and variance are equal.

○ not the case for RNA-seq counts, a phenomenon known as overdispersion

○ The negative binomial distribution takes this into account through a dispersion

parameter alpha

What is the source of overdispersion?

overdispersion: higher variance than Poisson

○ transcript is present at slightly different levels in each sample

DESeq2

read count model: typically estimated by the median-of-ratios method to

account for compositionality and sequencing depth.

○ dispersion shrinkage

■ 1. Treat each gene separately and estimate

dispersion (black points)

■ 2. Fit a smoothing curve (red)

■ 3. Shrink gene specific dispersion towards the

expected dispersion to get more robust results using

the Bayes approach

DESeq2:

Strength of shrinkage depends on:

How close are dispersion values to the fit?

■ Degrees of freedom, shrinkage is reduced for larger samples numbers

log fold change (LFC):

common difficulty in the analysis of HTS data is the strong variance of

LFC estimates for genes with low read count

short read vs long reads

Short read RNA-Seq technology enables accurate quantification of

exons/events

○ Long read RNA-Seq technology captures the correct full-length isoforms

Star

Seed searching

STAR aligns reads and searches for the longest sequence that

matches one or more locations on the reference genome, known as

Maximal Mappable Prefixes (MMPs)

STAR

extend MMPs

If STAR does not find an exact matching sequence for each part of the

read due to mismatches or indels, the previous MMPs will be

extended

STAR

soft clipping

■ If extension does not give a good alignment, then the poor quality or

adapter sequence will be soft clipped

STAR

cluster, stitching, scoring

separate seeds are stitched together to create a complete read by first

clustering the seeds together based on proximity

STAR

Why so fast?

uncompressed suffix arrays

splAdder

Source:

An initial graph G out of a genome annotation

■ A list of junctions from RNA-Seq

splAdder

Steps:

Add novel cassette exons into the graph G

■ Add novel intron retentions into the graph G

■ Add novel intron edges into the graph G

■ An augmented graph Ĝ

○ Don’t compare result values from different methods

Advantages of Pseudo alignment

allows for a faster estimation of transcript abundance then traditional

alignment based methods, without losing to much accuracy

○ Can classify diseases, understand expression changes, track cancer

○ progression

KALISTO

input: transcriptome, rna seq reads

○ output: Kallisto Index, Quantification

○ first step: build Kallisto Index

■ split transcriptome into k-mers

■ build de Bruijn Graph from k-mers, while remembering from which

transcript each k-mer is(k-compatibility class)

■ define linear section with the samek-compatibility class as a contig■ build a Kallisto Index: Hashmap that has each k-mer as key and as

value the contig it is in and the position on the contig

○ second step: pseudo alignment of reads

■ each read is split into k-mers

■ look up first k-mer of read in Kallisto index

■ from position of k-mer in contig, calculate which k-mer in read would

be the last in same contig

■ look up this k-mer to double check

■ look up next k-mer in new contig

■ intersect the k-compatibility class of all k-mers, that were looked up

○ last step: quantification:

■ now we work with equivalence class instead of k-compatibility class

■ equivalence class is set of of transcripts a read could stem from

■ use Expectation-Maximization algorithm to find the best possible

transcript abundances that explains the equivalence class counts

■ not mentioned in lecture: EM is repeated with bootstrap which allows

an estimate of the mean standard error

Techniques to quantify cells: flow cytometry

flow cytometry: characterize and define different cell types in an

heterogeneous cell population by analysing the expression of cell surface and

intracellular molecules

○ Immunohistochemistry (IHC staining): Immunohistochemistry uses antibodies

to bind to proteins in tissues. Using layers of coloured/fluorescent complexes,

it is possible to visualize location of specific cell types.

Deconvolution Problem

The deconvolution problem involves separating or extracting individual

components from a mixed signal or observation.

○ It requires estimating the contributions or proportions of each component to

recover the original sources or signals from the observed data

miRNA

microRNA

○ small (19-22 nc) non-coding RNA that downregulates gene expression by

binding in 3’ utr, acting in a complex with an Argonaute protein

siRNA

small interfering RNA

○ small (21-24 nc) non coding RNA that downregulates gene expression by

binding to RNA with complementary sequence and preventing translation

○ show perfect complementarity and high specificity

how does microrna regulate gene expression

They bind to the 3’-UTR (untranslated region) of their target mRNAs and

repress protein production by destabilizing the mRNA and translational

silencing

What can be used to computationally predict miRNA binding

miRNA seed pairing, Hybridization energy, Conservation of miRNA binding

sites, Accessibility of binding sites

what is Competing endogenous RNAs

In conclusion, we hypothesize that crosstalk between RNAs, both coding and

noncoding, through MREs, forms large-scale regulatory networks across the

transcriptome”- A ceRNA Hypothesis: The Rosetta Stone of a Hidden RNA

Language?: Cell

○ miRNA binding sites “competing for miRNA”

What is SPONGE

can be used to infer genome wide ceRNA interaction networks for biomarker

discovery

○ The competing endogenous RNA (ceRNA) hypothesis suggests that mRNAs

that possess binding sites for the same miRNAs are in competition. This

motivates the existence of so-called sponges, i.e., genes that exert strong

regulatory control via miRNA binding in a ceRNA interaction network

○ Difference to previous methods: considering combinatorial effect of several

miRNAs

Steps of SPONGE

Input: Gene Expression, miRNA Expression, Targetscan (db that predicts

targets of miRNA) miRcode(db for microRNA target sites)

○ Gene miRNA seed match and regression filter (many false positives are

filtered out by considering tissue specificity)

○ for every gene pair with shared miRNA calculate mscor(compute multiple

miRNA sensitivity correlation, represents how much their co-expression can

be explained by miRNA)

○ Null model based significance analysis, where null hypothesis miRNAs do not

affect the correlation between two genes, i.e. mscor=0

○ Network Construction by connecting significant genes

Why are proteins of interest

Proteins execute and control the vast majority of biological processes

What is proteomics

large scale analysis of proteins

What are the types of questions proteomics may address

Identification of potential drug targets

○ Investigation of disease mechanisms

○ Comparison of normal and diseased tissues

○ Protein expression across the cell cycle

○ Characterisation of cell types and tissues

○ Isolation of protein complexes and interaction networks

○ Investigation of biochemical pathways

What are amino acids

Amino acids are organic compounds that contain amino (–NH2) and carboxyl

(– COOH) functional groups, along with a side chain (R group) specific to

each amino acid

○ encoded directly by triplet codons

Why and how are protein digested

Digestion into peptides is required because the downstream separation and

mass spectrometry cannot handle complex mixtures of intact proteins!

○ protease Trypsin cleaves after K and R

○ Very stable and efficient

○ Modified trypsin to overcome autolysis problems and “proline problem”

What is elution

washing out substances from a solid material using a liquid or a gas

Liquid vs. Gas Chromatography

What is the difference between online and offline Chromotographie

In offline chromatography, the sample is first loaded onto the column and

then the column is disconnected from the system to be eluted. The eluted

sample is then collected in fractions for further analysis.

○ in online chromatography, the sample is loaded onto the column and then

eluted while still connected to the system. The eluted sample is then analysed

in real-time.

How can you separate peptides with the same mass but different

sequences/structure

They will have a different retention time (the time the analyte needs to pass

through the chromatographic column)

○ The mass spec will measure the mass of the analytes at different retention

times

What are the most important parts of mass spectrometer

A mass spectrometer is an instrument that measures the m/z of individual

ions

○ ionizer: ionizing the peptides we will analyse

○ Mass analyser: measures moving charges and determines m/z by recording

the frequency with which ions are oscillating along the z axis

○ Detector: detecting quantity

What are the different kind of masses

Nominal (or integer) mass: the sum of the integer masses of all elements in a

moleculeo Chemical or average mass: the sum of the masses of all stable isotopes of all

elements in a molecule weighed by their natural abundance

o Monoisotopic mass: the sum of the masses of the most abundant stable

isotopes of all elements in a molecule

Calculate the mass of Protein TEST

○ First add up the masses of the amino acid

○ 't': 119.06, 'e': 147.05, 's': 105.04 ->490.210

○ subtract a H20 for every peptide bond ->490.210-3*(2*1.0078+15.9949)

○ neutral mass done: 439.17850

○ for charged mass add mass of proton

○ for m/z dived through charge

calculating y and by fragments

y: reverse peptide, calculate mass add one proton and one H20

○ b: calculate mass add one proton

what are ms1 and ms2 scans

MS1 is the first stage of mass analysis, where the mass range of interest is

scanned and the precursor ions are selected.

○ MS2 is the second stage of mass analysis, where the selected precursor ions

are fragmented into smaller product ions and detected

What is the use of isotopes?

identify and quantify elements in sample by measuring m/z

○ identify biological and chemical origin of a protein

What is Tandem-MS?

peptides get fragmented, resulting in fragments that go either from the start or

the end of the peptide to the cut

○ the different fragments make it possible to reconstruct the sequence

Standard bottom-up workflow?

Extract protein and prepare it

○ Digest Protein with trypsin

○ Separate peptides after certain attributes (e.g. hydrophobic)

○ MS1 to get relative abundance from analytes

○ MS2 to get sequences of analytes

○ Peptide identification and quantification

○ Protein list or Matrix

○ Functional bioinformatics analysis

What is Mass Spectrum?

A 2-dimensional plot with the mass-to-charge ratio on the abscissa and the

abundance on the ordinate representing the abundance of chemical (ionized)

compounds in a mixture or sample solution as peaks

What is Mass spectrometry?

A mass spectrometer is an instrument that measures the masses of individual

molecules (or fragments thereof) that have been converted to ions

What is multiplexing?

Until now, only one sample can be measured by mass spectrometry

○ Multiplexing is a method by which multiple signals are combined into one

signal over a shared medium○ SILAC utilizes heavy cell culture media containing lysines and arginines

which contain either exclusively monoisotopic C and N, shifting the mass

spectrum

What is a Orbitrap

○ mass analyser and detector in one

○ outer electrode and central spindle

○ Ions are injected and oscillate stably around spindle

○ 3-dimensional movement of ions within orbitrap

○ The right to left movement (frequency) is used to determine the m/z (using the

fourier transformation)

○ resolution depends on the number of oscillations

Deriving peptide masses from small graphs

Trypsin (largely) generates peptides ending in K and R ->lok for

corresponding mass

○ Check all following peaks if they have a mass difference of a known amino

acid residue

De Novo Peptide Sequencing

o Then looks at the MS2 graph and checks at all the pairwise differences

between the peaks, makes a graph of them from which you can read all

possible sequences and gives a list of the sequences as output

o Problems:

▪ Peptide fragmentation does not always result in detectable ions -

>Missing y- and b-ions in spectrum

▪ Intensity of ions is not “predicable” (but reproducible)

▪ There might be more than one interpretation of the spectrum

▪ Not really used anymore unless there is no proteomics data for the

organism

What is the difference between ion traps and orbitraps

Both mass analyser that measure m/z

○ Orbitraps have a high resolution at low m/z, but low resolution at high m/z

○ ion traps have constant resolution

When and how do you apply tolerance in DA

constant resolution(io traps)

○ simply subtract and add tolerance to neutral mass to get lower and upper

bound

○ example 0.5 DA: 464.7347++ ->927.45 ->926.95-927.95

When and how do you apply tolerance in ppm

non constant resolution (Orbitraps)

○ divide neutral mass by 1.000.000, multiply with 15. Result is tolerance in DA

to be applied

○ example 15ppm: 464.7347++ ->927.45 +/-15ppm ->927.45 +/- (927.45/ 1M)

*15 ->927.45 +/- 0.014

How does database search work?/How to Build a database search engine to identify

proteins in mass spectrum

Calculate neutral peptide mass

○ search database for all digested proteins with matching mass (apply

resolution tolerance)

○ create spectrum for candidates (Fragmentation and therefore intensity ca be

predicted)

○ compare made spectrum against experimental and match peaks

How do you estimate the false discovery rate

What is protein interference

As input we have proteins, that are digested and measured in a mass

spectrometer

○ Then we try to identify the peptides that make up the spectrum(find ~25%)

○ Try to infer proteins from peptides

what is spectral library searching

○ a technique used in mass spectrometry to identify unknown compounds by

comparing their spectra to a library of known spectra

What are the advantages and disadvantages of DP searching and spectral library searching

how to compare spectra

each peak is a tupel (m/z, intensity), spectrum is list of tupels

○ m/z of two spectra are never completly the same (uncertainty because both

come from experiments) → use tolerance to align m/z (we have one m/z listand two lists of intensities (if there is no peak at a certain m/z in one intensity

list its 0), so now we can compare)

○ Different questions to ask: are spectra identical? Is one spectrum contained in

another?

○ Normalization: p2-norm: all vectors have the same length of 1

○ Similarity measurement: spectral angle, Pearson correlation, many other

options though

How to create decoys for Spectral library searching

For peptides, one can use pre identified spectra and “move” annotated (e.g. y

and b fragment ions) peaks around. Keeping intensity of the original

fragments, keeping retention time of the original peptide identification

○ not easy for metabolites

What is a feature finding

Map: Three dimensional data set (RT, m/z, intensity) containing the MS signal

from one LC MS run

○ Feature: The sum of all the MS signals caused by the same analyte/peptide

○ Feature finding: Finding the set of features explaining as much of the signal in

a map as possible.

What are different ways to detect Features

finding local maxima + persistance (reduces the number of maxima, have to

guess this value)

○ Fitting distributions: we know m/z + intensity the peptides have normal

distribution → fit many normal distributions (guess how many distributions)

○ convolutions: slide normal distributions along a spectrum and see if signal like

this is there (computationally intense) a bit like Fourier transformation

○ continues wavelet transformation: in different runs a wave (Mexican hat

wavelet) with a certain size slides over the spectrum and sees if any peaks fit

it, the size is is increased with each run to account for the different shapes

and heights of peaks (is done on retention time and m/z dimension)

○ Average model: (in RT dim) calculate the average aa, how many of these aas

do we need to make up the numeric mass of a peptide? → what does the

distribution of isotopes of a peptide of this mass look like; tells us where the

peaks will probably be, look if we can find the matching ones

What is indexed retention time

What is the goal of linking features

Goal: finding out of some peptides go up or down in different conditions

(linking several different experiments)

○ Find features in healthy and disease maps

○ Align maps

○ link corresponding features

■ Problem: Features might not be identified in all measurements■ With perfect measurement we could transfer identifications, by

matching m/z, isotope envelope, charge, retention time and filling up

missing values -> match between runs

■ But difficult because of fluctuating retention time

■ normalize with indexed retention time

■ Alternatively: fitting a line or curve to retention time between different

measurements

■ Dynamic time warping, which aligns two or more time series(retention

times)

■ bidirectional best hits: if feature 1 is the most similar to feature 2 and

the other ways around, they are bidirectional best hits

○ identity features

○ Quantify

Computational problem in mass spectrometry when having many proteins

Scalability: As the number of proteins increases, so does the size and

complexity of the search space and the computational resources required to

perform the analysis.

○ Ambiguity: Many proteins share similar or identical peptide sequences, which

can make it difficult to assign a unique protein identity to a given mass

spectrum

○ Combinatorial complexity: When analysing complex mixtures of proteins,

such as whole proteomes or protein complexes, there can be many possible

combinations of peptides that explain a given mass spectrum

DeepLearning in MS

PointIso - Detecting peptide features

○ Prosit – peptide fragment intensity prediction

○ DeepLC – peptide retention time prediction

○ Metabolite retention time prediction

○ MS2DeepScore –spectrum similarity learning

○ Cassanovo–peptide de novo sequencing

○ Frage Stunde

Beitreten

Vorschau

Author

Katerina M.

Informationen

Zuletzt geändert
vor 2 Monaten

Kurs melden