Sign In


Seminar: Large scale annotation of proteins with labelling methods 


prof. Rita Casadio, Friday 11 May -- h: 15.00 -- Sala Seminari Ovest
As a result of large sequencing projects, data banks of protein sequences and structures are growing rapidly. The number of sequences is however orders of magnitude larger than the number of structures known at atomic level and this is so in spite of the efforts in accelerating processes aiming at the resolution of protein structure.

Tools have been developed in order to bridge the gap between sequence and protein 3D structure, based on the notion that information is to be retrieved from the data bases and that knowledge-based methods can help in approaching a solution of the protein folding problem. By this several futures can be predicted starting from a protein sequence such as structural and functional motifs and domains, including the topological organisation of a protein inside the membrane phase, and the formation of disulfide bonds in a folded protein structure (1). Our group has been contributing to the field with different computational methods, mainly based on machine learning (neural networks (NNs), hidden markov models (HMMs), support vector machines (SVMs), hidden neural networks (HNNs) and extreme learning machines (ELMs)) and capable of computing the likelihood of a given feature starting from the protein sequence ( Our methods can add to the process of large scale proteome annotation (endowing sequences with functional and structural features).

Recently Conditional Random Fields (CRFs) have been introduced as a new promising framework to solve sequence labelling problems in that they offer several advantages over Hidden Markov Models (HMMs), including the ability of relaxing strong independence assumptions made in HMMs. However, several problems of sequence analysis can be successfully addressed only by designing a grammar in order to provide meaningful results. We therefore introduced Grammatical-Restrained Hidden Conditional Random Fields (GRHCRFs) as an extension of Hidden Conditional Random Fields (HCRFs). GRHCRFs while preserving the discriminative character of HCRFs, can assign labels in agreement with the production rules of a defined grammar (2). The main GRHCRF novelty is the possibility of including in HCRFs prior knowledge of the problem by means of a defined grammar. Our current implementation allows regular grammar rules. We tested our GRHCRF on two typical biosequence labelling problem: the prediction of the topology of Prokaryotic outer-membrane proteins and the prediction of bonding states of cysteine residues in proteins (3-5), proving that the separation of state names and labels allows to model a huge number of concurring paths compatible with the grammar and with the experimental labels without increasing the time and space computational complexity.


Created at 5/3/2012 11:09 AM  by Antonio Cisternino 
Last modified at 5/3/2012 11:24 AM  by Antonio Cisternino