Lecture 5 "Fuctional descriptionand annotaton transfer"

Buffl

Bioinformatics

von Sina E.

what happens if you sequence a lot?

the more you sequence, the higher the chance to pick up ypur gene of interest

what are protein motifs?

sort inear sequences that serve a specific function for the protein, but will not be stable or fold independent of the rest of the chain
often represented as regular expression, for example, C-x-(C)-(DN)-x(5)-C-C
protein-interaction, ligand interactions, cleavage sites, targeting

what happens to a function over the years?

what motfs are often used for identying proteins?

often used are upstream sequences in the transcription binding site
from prior experiments we know ROX1binding sites, they form training data

what is the count matrix?

created out of training data
four lines
counts the of the number of characters of the seqquenxe

with the Kulback-Leibler Distanz the distribution is measured

if you change letters n position of the sequence the score changes

what is the Kulback-leible Distanz?

Measures the difference between two probability distribuions

Distribution: observed nucleotide frequency at Position i (f b,i)
Distribution: expected nucletoide frequency at position i given the overall (position-independet) nucleotide composition of te sequences

how do you become from trainigs data to PSSMS?

what are motifs?

represent extrinsic information in he analysis of protein sequences
typically short sequence stretches that are linked to a particular function
Represented by regular expression or position specific scoring matrices (PSSM)
PSSMs typically contain prior information to account for non-observed data
fixed length, i.e. gap-free
motif identification suffers from a high flse positive rate. Reason: short motifs have a hogh chance of occurring at random
often used for the prediction of bidning sites, e.g. transcription factor binding sites, and for the prediction of modification sites
PSSms can also be used for FAST omain annotation

how can ypu transform profles into a (hidden) markov modell?

each column in an algnment is reprsented by one column in the PSSM, and by 0 to 1 states is a pHMM
The occurrence of each letter in the alphabet of the analysed sequence type is associated with a colmn-specific score in the PSSM, an with an emission porbability in the pHMM
The order of columns in the pHMM is fixed and there is no alternative from visiting the columns from left to right. In pHMMs there is a transition probability <1 between connected states

what is a match state?

position in the lignment for which all or at least most of the sequences are represented by amino acid. In evolutionary terms, a match state is an ancestral position with a very low deletion probability
the idea is to generate a HMM with a repetitive strucutre of states that differ in their emission probabilities for the 20 aa

How do you predict the sequences?

D=deletion states
M= matching states
I = insertion state

catches loops singe sequences are variable
catches variabilities with models wit same function

how do you plan PHMMS?

how is the workflow to create PSSMs?

what are the main aspects about profile hidden markov models?

represent extrinsic information in the analysis of protein sequences
model typically longer sequence stretches that are linked to a particular function, or are simply evolutionary conserved (c.f. Domain od unknown function (DUF) in Pfam)
contain priot information to account for non observed data
variable length, i.e. explicit modeling of insertion and deletions
typically highly specific and allow the identification even in highly diverges sequences
commonly used for the prediction of functional domains
can be arranged in domain architectures capturing the context in which individual domains occur

what descibes the function of a protein?

RRM = RNA recognition motif
- subcellular localisation
- ineraction partners
- physiological reference - involved in retinal function
not indvidual protein domains, but the complete feature architecture (ie. the combination of multiple protein domains and features) defines the function of a protein)

what is a major challenge with the protein function?

How to describe protein function - or more generally biological knowledge- in a way that makes it accessible to automated analysis?

what helps to standardize the functional annotation?

Controlled vocabulary

definition: arranged list of standardizing terminology
takes the guess work out of searching
allows categorization
example: yellow pages
Different level of complexity:
- equivalence relationships
- hierachical relationships (taxonomy)
- associative relationships (thesaurus)

what is the cellular component (CC)?

describes the localisation within the cell, anatomical strucute of macro-molecular complex (e.g. ribosome)

what is the molecular function (MF)?

describes the molecular activities, such as catalytic or binding activity (e.g. kinase) does not specify the context of action

what descirbes the biological process (BP)?

describes a series of events in a physiological process, e.g. signal transduction, not a pathway

how is the gene ontoogy organized?

parent & child terms
ìsa`, `part_of` and `regulates`relationships
directed acylic graphs
more flexible than hierachy
each gene is annotated with set of most specific terms that describe its functionality

what is the GO Term overrepresentation analyses?

transcriptional profiling (microarray/RNA-seq) to compare condition, e.g. healthy vs. disease tissue
search for terms that are significantly overrepresented

what is importancet about the gene ontology?

developed to describe gene function such it accesible to computer-aided analysis
based on a restricted vocabulary (describing entities are referred to as `terms`)
terms are hierchally organized and are `related`to each other
term of relationships are organized in a directed acylic graph
GO term - gene association is based on different evidences
evidence code inferred by electronic annotation is better than expected
GO term enrichment analyses help to identify terms that occur (or less) ften than expected by chance
Graphical fisplay of GO enrichment analysis otcome helps interpreting the result
GO slim, when not all information in the GO is relevant

Beitreten

Vorschau

Author

Sina E.

Informationen

Zuletzt geändert
vor einem Monat

Kurs melden