The life sciences have relied on the so-called scientific method—discovery through a
process of observation, hypothesis formulation, data generation, more observation and so
on. In these well-established domains, like biology, data play an ancillary role to the
hypotheses (and are actually often called “hypothesis driven”). There are long standing
traditional as well as practical reasons why data has been subordinate, and we only
present a few here: data need not be present during the observation and hypothesis
formulation phases of discovery; a surplus of data does usually little to enhance this
process of discovery; data are often prohibitively expensive to either produce or gather
once, let alone many times; data, once available and then used, seldom are used again.
But the life sciences have been experiencing fundamental changes in the last decade
brought about by the recent and rapid advancements both in information technology and
technology in the broader sense. These changes have culminated in scores of massive
information science projects tied together through the internet that share life science data.
From the general public's perspective, the most recognizable of these projects is the
Human Genome Project led by the National Human Genome Research Institute (NHGI)
which has produced drafts of the human genome with its promise, for example, of
understanding, anticipating, and treating diseases.
A consequence of these changes has been the emergence of new interdisciplinary
sciences that form from a confluence of existing parent disciplines and technology—the
most notable being ‘bioinformatics’ that combines biology, computer science, statistics,
database, and the internet. Like traditional biology, bioinformatics seeks to make
discoveries about life, but is remarkable for at several reasons.
First, bioinformatics brings with it new challenges that seem to cut across all emerging,
technologically driven disciplines—how to cohesively bring together mature disciplines
that have not had any history of deep connections. Second, bioinformatics has had
profound changes on its parent disciplines—from making reductionists of biologists, to
moving from algorithmic design to problem formalization in computer science, to
rethinking of the database as a mix of quantitative methods and logic, rather than only the
latter. Third, there are pressing problems having to do with the data itself: the enormity
of the amount of data, its heterogeneity (both in terms of structure and kind), provenance,
noise, integration, resolution, rate of generation and collection, its meaning and
usefulness, and management. But perhaps the most profound reason has to do with how
science itself is being conducted—bioinformatics has turned the scientific method on its
head; data are generated and collected without any explicit preceding hypotheses (often
called “technologically driven” data). In fact, even the “flow” breaks down—no longer
linear, but from data, to observation, to hypothesis, to data, and so forth. And yet
significant biological discoveries are being made. And data has become paramount.
What we believe is occurring is that a life science, biology, is being transformed into an
information science ushering in wonderful possibilities and equally difficult challenges.
As a scientist—an information scientist—it is certainly an exciting time.