Protein Annotation as Term Categorization in the Gene Ontology

Karin Verspoor, Judith Cohn, Cliff Joslyn, Sue Mniszewski, Andreas Rechtsteiner, Luis M. Rocha, Tiago Simas
Modeling, Algorithms, and Informatics Group (CCS-3)
Los Alamos National Laboratory, MS B256
Los Alamos, New Mexico 87545, USA

Citation: Verspoor, K., J. Cohn, C. Joslyn, S. Mniszewski, A. Rechtsteiner, L.M. Rocha, T. Simas [2004]. "Protein Annotation as Term Categorization in the Gene Ontology". EMBO Workshop: A critical assessment of text mining methods in molecular biology, Granada, Spain, March 28-31, 2004. Los Alamos National Laboratory Internal Report Number: LAUR 04-1460.

The full paper is available in Adobe Acrobat(pdf) format only. Due to mathematical notation and graphics, only the abstract is presented here.


We addressed BioCreAtIvE Task 2, the problem of annotation of a protein with a node in the Gene Ontology (GO). We approached the task as a problem of categorizing terms derived from the document neighborhood of the given protein in the given document into nodes in the GO based on the lexical overlaps with terms on GO nodes and terms identified as related to those nodes. The system incorporates NLP components such as a morphological normalizer, a named entity recognizer, a statistical term frequency analyzer, and an unsupervised method for expanding words associated with GO ids based on a probability measure that captures word proximity (Rocha, 2002). The categorization methodology uses our novel Gene Ontology Categorizer (GOC) methodology (Joslyn et al. 2004) to select GO nodes as cluster heads for the terms in the input set based on the structure of the GO.

Keywords:Text Mining, Information Retrieval, Computational Biology, Bioinformatics, Genomics, Proteomics, Gene Ontology, Portein Function, Function, Annotations.

For the full paper please download the pdf version

For more information contact Luis Rocha at
Last Modified: September 02, 2004