Classification of protein-protein interaction full-text documents using text and citation network features

Artemy Kolchinsky1,2, Alaa Abi-Haidar1,2, Jasleen Kaur1, Ahmed Abdeen Hamed, and Luis M. Rocha1,2,*

1School of Informatics and Computing, Indiana University, 1900 East Tenth Street, Bloomington IN 47408, USA
2FLAD Computational Biology Collaboratorium, Instituto Gulbenkian de Ciencia, Portugal
*To whom correspondence should be addressed:

Citation: A. Kolchinsky, A. Abi-Haidar, J. Kaur, A.A. Hamed, and L.M. Rocha [2010]."Classification of protein-protein interaction full-text documents using text and citation network features". IEEE/ACM Transactions On Computational Biology And Bioinformatics, 7(3):400-411. DOI: BibTex

The full text and pdf re-print are available from the TCBB site. Due to mathematical notation and graphics, only the abstract is presented here. Our pdf pre-print is also available.


We participated (as Team 9) in the Article Classification Task: binary classification of full-text documents relevant for protein-protein interaction of the Biocreative II.5 Challenge. We used two distinct classifiers for the online and offline challenges: (1) the lightweight Variable Trigonometric Threshold (VTT) linear classifier we successfully introduced in BioCreative 2 for binary classification of abstracts, and (2) a novel Naive Bayes classifier using features from the citation network of the relevant literature. We supplemented the supplied training data with full-text documents from the MIPS database. The lightweight VTT classifier was very competitive in this new full-text scenario: it was a top performing submission in this task, taking into account the rank product of the Area Under the interpolated precision and recall Curve, Accuracy, Balanced F-Score, and Matthew’s Correlation Coefficient performance measures. The novel citation network classifier for the biomedical text mining domain, while not a top performing classifier in the challenge, performed above the central tendency of all submissions and therefore indicates a promising new avenue to investigate further in bibliome informatics.

Keywords:Protein-protein interaction, text mining, bibliome informatics, support vector machines, citation network, complex networks, Literature Mining, Binary Classification.

For more information contact Luis Rocha at Check the Web Design Credits, for due credit.
Last Modified: July 29, 2010