Much of the research presently conducted in the biomedical domain relies on the induction of correlations and interactions from data. Because we ultimately want to increase our knowledge of the biochemical and functional roles of genes, proteins and treatments in organisms, there is a clear need to integrate the associations and interactions among biological entities that have been reported and accumulate in the literature, databases, the web and social media. Biomedical literature mining is an important informatics methodology for large scale information extraction from repositories of textual documents, as well as for integrating information available in various domain-specific databases and ontologies, ultimately leading to knowledge discovery. It helps uncover relationships and interactions buried in the literature and other media, from experimental to phenomenological data. We have been working for a long time to us tap into the vast biomedical collective knowledge available in various data sources, which we can think of as the "bibliome" or the "digital phenotype" in health and disease. Our approach to is based on bottom-up, network science, data-driven or bio-inspired methods, which we have applied to automatic discovery, classification and annotation of protein-protein and drug-drug interactions, pharmacokinetic data, protein sequence family and structure prediction, functional annotation of transcription data, enzyme annotation publications, and so on. Examples of these are shown below, together with links to additional resources and publications.

Subnetwork of word co-occurrence proximity (with 34 words) for a specific document from the first BioCreative competition. The red nodes denote the words retrieved from a s specific GO annotation (0007266: Rho, protein, signal, transduce). The blue nodes denote the words that co-occur very frequently with at least one of the red nodes: the co-occurrence neighborhood of the GO words. The green nodes denote the additional words discovered by our network algorithm as described in (Verspoor et al,2005).

The Social Network of Healthcare - How Instagram and Twitter are Providing New Insights. Luis Rocha explains the new software-driven approach to medical research. Big data generated through social media such as Twitter and Instragram provides leads to actionable insights to improve the efficacy of prevention and treatment.

Public health monitoring of drug interactions from Social Media

Social media and mobile application data enable population-level observation tools with the potential to speed translational research. We have shown recent workdemonstrating Instagram’s importance for public surveillance of drug interactions. Our methodology is based on the longitudinal analysis of social media user timelines at different timescales: day, week and month. Weighted graphs are built from the co-occurrence of terms from various biomedical dictionaries (drugs, symptoms, natural products, side-effects, and sentiment) at various timescales. We showed that spectral methods, shortest-paths, and distance closures reveal relevant drug-drug and drug-symptom pairs, as well as clusters of terms and drugs associated with the complex pathology associated with depression. We validate inferences about drug interactions and adverse reactions via curated bioinformatics databases (e.g. DrugBank and SIDER), and develop demo tools to share our analysis with the community. We currently analyze various social media sources such as: Twitter, Facebook, ChaCha and the Epilepsy Foundation public forums, and have focused on studting depression, epilepsy, and opioid abuse.

Drug-Drug interaction extraction from Literature

Drug-drug interactions (DDIs) are major causes of morbidity and mortality and a subject of intense scientific interest. Biomedical literature mining can aid DDI research by extracting evidence for large numbers of potential interactions from published literature and clinical databases. We started with the estimation of pharmacokinetics numerical data from literature to mine drug-specific (e.g. Midazolam (MDZ)) pharmokinetic (PK) clearance data (systemic and oral) from the literature. We obtained 88% precision rate and 92% recall rate are achieved, with an F-score = 90%. Out-performs support vector machine (F-score of 68.1%). Further investigation on 7 other drugs showed comparable performance [Wang et al, 2009]. Recently, we received funding for a four-year ($1.7M) R01 grant from from NIH/NLM to study the large-scale extraction of drug-Interaction from medical text. This is a collaboration with Prof. Lang Li from IUPUI Medical School, and Prof. Hagit Shatkay from the University of Delaware. While evidence for DDI ranges in scale from intracellular biochemistry to human populations, literature mining methods have not been used to extract specific types of experimental evidence which are reported differently for distinct experimental goals. We have used the team's manually curated corpora [Wu et al, 2013] of PubMed abstracts and annotated sentences with three types of experimental DDI evidence: in vitro, in vivo, and clinical. The goal is the production of a text mining pipeline using several linear classifiers and a variety of feature transformation methods. Preliminary results [Kolchinsky et al 2015] on pharmacokinetics DDI experimental evidence in PubMed has yielded excellent classification performance in distinguishing relevant and irrelevant abstracts (reaching F1 ~= 0.93, MCC ~= 0.74, iAUC ~= 0.99) and sentences (F1 ~= 0.76, MCC ~= 0.65, iAUC ~= 0.83).

Estimated PK clearance parameter data from literature.Wang, Z., et al (2009)

PPI task- Decision structure on the protein-protein interaction article test data of Biocreative II, as produced by our Variable Trigonometric Threshold model.Abi Haidar, A et al. (2008)

Protein-Protein Interaction Discovery (PPI)

Until now, literature mining has been applied essentially to help annotate and characterize molecular entities such as genes and proteins. In the next few years the field is expected to move to aid the discovery and automatic annotation of relationships among such entities, e.g. protein-protein and gene-disease interactions. Indeed, the Biocreative challenges II, II.5, and III, which we participated in [Abi-Haidar et al,2008], [Kolchinsky et al, 2010], [Lourenco et al, 2011]), includes a series of tasks on extraction of protein-protein interaction information from the literature. As the field moves to uncovering relations rather than entities, our complex network approach to biomedical literature mining [Verspoor et al,2005], which we tried on the first BioCreative competition, makes all the more sense. Additionally, since literature mining hinges on the quality of available sources of literature as well as their linkage to other electronic sources of biological knowledge, it is particularly important to study the quality of the inferences it can provide. We were among most competitive teams in the PPI tasks of BioCreative II, II.5 and III. See our PIARE (Protein Interaction Abstract Relevance Evaluator) web tool for classification of documents relevant for protein-protein interaction, as well as supplementary materials for publications.

Protein Family Prediction (PFP)

Since literature mining hinges on the quality of available sources of literature as well as their linkage to other electronic sources of biological knowledge, it is particularly important to study the quality of the inferences it can provide. We have been working in the large-scale validation of bibliome algorithms , and proposed a method that predict a protein’s Pfam family correctly 76% of the time and 89% of the time issue a prediction that will be among top 5 families [Maguitman et al,2006].

Proteins voting in proportion to their cosine similarity to the target protein. Maguitman, A. et al (2006)

PSP task - Our combined method performs significantly better than either the original structure predictionor keyword based prediction methods alone. Rechtsteiner, A., et al (2006)

Protein Structure Prediction (PSP)

Linking of information from different data sources, specifically literature, becomes increasingly important to annotate the growing number of new genome sequences. For the large percentage of genes with no known sequence homologs, new, possibly integrative, methods need to be developed. Ab-initio structure prediction and comparison is a method some of us pursued previously for functional annotation of sequences with no known homologs. We used a large set of sequences of known structure to evaluate a literature-based method against previously used ab-initio structure prediction methods. The Literature-mining prediction is comparable to best ab-initio methods in lack of sequence homology. Combining text-mining with ab-initio method leads to 35% improvement over ab-initio method alone. See [Rechtsteiner et al, 2006]

Characterizing gene regulation

Spectral methods such as Singular Value Decomposition (SVD), are very useful for tasks ranging from gene expression analysis [Wall, Rechtsteiner and Rocha, 2003] to automatic functional annotation of genes and proteins from the litrature [Rechtesteiner, 2005; Maguitman, A. et al, 2006; Haidar et al, 2008;]. We have studied SVD-based methods for visualization of gene expression data, representation of the data using a smaller number of variables, and detection of patterns in noisy gene expression data. SVD (“eigen-clustering”) of microarray data produces sets of co-expressed genes, which were then characterized with annotations automatically extracted from literature .

Rechtsteiner, A. (2005). PhD Dissertation.

Funding Project partially funded by




Project Members

Luis Rocha

John Duke

Lang Li

Predrag Radivojac

Hagit Shatkay

Analia Lourenco

Ana Maguitman

Al Abi-Haidar_160

Michael Conover

Mohsen JafariAsbagh

Jasleen Kaur

Artemy Kolchinsky

Azadeh Nematzadeh

Andreas Rechtsteiner

Tiago Simas

Zhiping (Paul) Wang

Rion Brattig Correia




Selected Project Publications