Uncovering protein interaction in abstracts and text using a novel linear model and word proximity networks

Alaa Abi-Haidar1,6, Jasleen Kaur1, Ana G. Maguitman2, Predrag Radivojac1, Andreas Retchsteiner3, Karin Verspoor4, Zhiping Wang5, Luis M. Rocha1,6,*

1School of Informatics, Indiana University, 1900 East Tenth Street, Bloomington IN 47408, USA
2Universidad Nacional del Sur, Bahia Blanca, Argentina
3Center for Genomics and Bioinformatics, Indiana University, USA
4Information Sciences Group, Los Alamos National Laboratory, USA
5Biostatistics, School of Medicine, Indiana University, USA
6FLAD Computational Biology Collaboratorium, Instituto Gulbenkian de Ciencia, Portugal
*To whom correspondence should be addressed: rocha@indiana.edu

Citation: A. Abi-Haidar, J. Kaur, A. Maguitman, P. Radivojac, A. Retchsteiner, K. Verspoor, Z. Wang, and L.M. Rocha [2008]."Uncovering protein interaction in abstracts and text using a novel linear model and word proximity networks". Genome Biology, 9(Suppl 2):S11. doi:10.1186/gb-2008-9-s2-s11

The full text and pdf re-print are available from the Genome Biology open access site. Supplemental materials are also available. Due to mathematical notation and graphics, only the abstract is presented here.


Background: We participated in three of the protein-protein interaction subtasks of the Second BioCreative Challenge: classification of abstracts relevant for protein-protein interaction (IAS), discovery of protein pairs (IPS) and text passages characterizing protein interaction (ISS) in full text documents. We approached the abstract classification task with a novel, lightweight linear model inspired by spam-detection techniques, as well as an uncertainty-based integration scheme. We also used a Support Vector Machine and the Singular Value Decomposition on the same features for comparison purposes. Our approach to the full text subtasks (protein pair and passage identification) includes a feature expansion method based on word-proximity networks.

Results: Our approach to the abstract classification task (IAS) was among the top submissions for this task in terms of the measures of performance used in the challenge evaluation (accuracy, F-score and AUC). We also report on a web-tool we produced using our approach: the Protein Interaction Abstract Relevance Evaluator (PIARE). Our approach to the full text tasks resulted in one of the highest recall rates as well as mean reciprocal rank of correct passages.

Conclusions: Our approach to abstract classification shows that a simple linear model, using relatively few features, is capable of generalizing and uncovering the conceptual nature of protein-protein interaction from the bibliome. Since the novel approach is based on a very lightweight linear model, it can be easily ported and applied to similar problems. In full text problems, the expansion of word features with word-proximity networks is shown to be useful, though the need for some improvements is discussed.

Keywords:Protein interaction, text mining, bibliome informatics, support vector machines, singular value decomposition, spam detection, uncertainty measures, proximity graphs, complex networks.

For more information contact Luis Rocha at rocha@indiana.edu. Check the Web Design Credits, for due credit.
Last Modified: October 27, 2008