The prime example of a Document Network (DN) is the World Wide Web (WWW). But many other types of such networks exist: bibliographic databases containing scientific publications, social networking services, as well as databases of datasets used in scientific endeavors. Each of these databases possesses several distinct relationships among documents and between documents and semantic tags or indices that classify documents appropriately. For instance, documents in the WWW are related via a hyperlink network, while documents in bibliographic databases are related by citation and collaboration networks. Furthermore, documents can be related to semantic tags such as keywords used to describe their content. Given these relations, we can compute distance functions (typically via co-occurrence measures) amongst documents and/or semantic tags, thus creating associative, weighted networks between these items—which denote stronger or weaker co-associations. The figure to the right represents such an associative network of people names extracted from co-occurrence in documents in a database as described in an internal report. You can also see a 3D Video (Real Video) of this network.
We have used distance and proximity graphs to uncover modules or clusters in the network that may be associated with a particular topic or community of interest. We have applied this clustering methodology to social networks (terrorist networks), keyword networks, scientific journal networks, and citations networks (see our bibliome informatics and adaptive web projects for more details),etc. Recently, we have also used an information-theoretic approach to classify documents of interest in probabilistic graphs of citation co-occurrence in scienticfic citation networks [Kolchinsky et al, 2010].
We study the non-metric network topologies that arise in weighted graphs obtained from real-world data (e.g. co-occurrence statistics). In particular, we have developed measures to extract the graph nodes which most violate the triangle inequality: semi-metric associations. Our working hypothesis is that strong semi-metric associations can be used to identify trends, items with a higher probability of co-occurring in the future, as well the dynamics of such networks in general. This methodology has been successfully applied to networks of published documents, recommender systems for digital libraries at the Los Alamos National Laboratory, web search and recommendation by the givealink.org project, networks of felons obtained from intelligence records, and gene networks (see publications below). This work has been pursued in the Identification of Interests, Trends and Dynamics in Document Networks Project as well as in a Los Alamos Homeland Security LDRD DR project, “Advanced Knowledge Integration (LDRD Reserve)” to discover latent associations in social networks (internal report available). Finally, we are also studying the effects of the various transitive closures that are possible in weighted graphs; namely which closures preserve scale free structure and lead to highest performance in information retrieval. Some of this work has been funded by NSF grant from the Human and Social Dynamics program, With Eliot Smith and Rob Goldstone—this project recently received some attention in the media.
We have developed a stochastic model of vertex aging in networks, to better predict network growth [Simas and Rocha, 2008]. Real world networks display a cut-off in the power-law node degree distributions of complex networks, not expected by the canonical Barabasi-Albert Model. Amaral et al had shown that this cut-off behavior can be computationally modeled with vertex aging. We produced a mathematical model of vertex aging, which allows accurate predictions of the equilibrium point of active vertices and relate network growth with probability of aging.