Citation: Rocha, Luis M. . "Semi-metric Behavior in Document Networks and its Application to Recommendation Systems". In: Soft Computing Agents: A New Perspective for Dynamic Information Systems. V. Loia (Ed.) International Series Frontiers in Artificial Intelligence and Applications. IOS Press, pp. 137-163
The full paper is available in Adobe Acrobat (.pdf) format only. Due to mathematical notation and graphics, only the abstract and introduction are presented here.
Recommendation systems for different Document Networks (DN) such as the World Wide Web (WWW), Digital Libraries, or Scientific Databases, often make use of distance functions extracted from relationships among documents and between documents and semantic tags. For instance, documents in the WWW are related via a hyperlink network, while documents in bibliographic databases are related by citation and collaboration networks. Furthermore, documents can be related to semantic tags such as keywords used to describe their content. The distance functions computed from these relations establish associative networks among items of the DN, referred to as Distance Graphs, which allow recommendation systems to identify relevant associations for individual users. However, modern recommendation systems need to integrate associative data (defined by distance graphs) generated from multiple sources such as different databases, web sites, and even other users. Thus, we are presented with a problem of combining evidence (about associations between items) from different sources characterized by distance functions. In this paper we describe our work on (1) inferring relevant associations from, as well as characterizing, semi-metric distance graphs and (2) combining evidence from different distance graphs in a recommendation system. Regarding (1), we present the idea of semi-metric distance graphs, and introduce ratios to measure semi-metric behavior. We compute these ratios for several DN such as digital libraries and web sites and show that they are useful to identify implicit associations. We also propose a model based on semi-metric behavior that allow us to quantify the amount of important latent associations in a DN. Regarding (2), we describe an algorithm to combine evidence from distance graphs that uses Evidence Sets, a set structure based on Interval Valued Fuzzy Sets and Dempster-Shafer Theory of Evidence. This algorithm has been developed for a recommendation system named TalkMine.
The prime example of a Document Network (DN) is the World Wide Web (WWW). But many other types of such networks exist: bibliographic databases containing scientific publications, preprints, internal reports, as well as databases of datasets used in scientific endeavors (Such as MEDLINE, the e-Print Arxiv, and the GenBank for Nucleic Acid Sequences.). Each of these databases possesses several distinct relationships among documents and between documents and semantic tags or indices that classify documents appropriately.
DN typically function as information resources for communities of users who query them to obtain relevant information for their activities. We often refer to collections of document networks and communities of users as Distributed Information Systems (DIS) (Rocha 2001b). DIS such as the Internet, Digital Libraries, and the like have become ubiquitous in the past decade, demanding the development of new techniques to both cater to the information needs of communities of users as well as to understand several aspects of the structure and dynamics of DN. The first set of techniques, needed to respond to engineering needs, come from the field of Information Retrieval, and are typically known as Recommender Systems (Krulwich & Burkey 1996, Konstan et al. 1997, Herlocker et al. 1999, Rocha 2001b). The second set of techniques, needed to respond to scientific interest in DN, come from the desire to analyze networks of documents and/or communities and is more related to Graph theory, Algebra, Complex Systems, as well as Linguistics (Kleinberg 1999, Chakrabarti et al. 1999, Berry, Dumais, & Obrien 1995, Li, Burgess, & Lund 2000, Newman 2001). Clearly, the scientific and engineering techniques complement and influence each other. We expect to use scientific knowledge about DN to improve recommendation algorithms, namely by better understanding users and information resources, and by producing adaptive DN, or adaptive webs (Bollen & Heylighen 1998, Bollen & Rocha 2000, Rocha 2001a).
The information retrieval and recommendation systems we have developed in this area are based on Multi-Agent algorithms which integrate knowledge about the association amongst elements of DN, amongst users, and about the interests of individual users and their communities. In particular, a soft computing algorithm (TalkMine) has been created to integrate such evidence and in so doing adapt DN to the expectations of their users (Rocha 2001b). The process of integration of knowledge in TalkMine requires the construction of distance graphs from DN that characterize the associations amongst their components. We summarize in sections 8 and 9 how TalkMine uses evidence theory and fuzzy set constructs to integrate such distance graphs to obtain measurements of relevance and produce recommendations. Before that, we detail the main goal of the present work, which is to study distance graphs extracted from DN. We show that their metric behavior can be used (1) as an indicator of the relevance of collections of documents and the interests of users who have selected certain sets of documents, (2) to identify the trends in communities associated with sets of documents, and (3) to study the characteristics of such DN in general, particularly, the amount of important latent associations.
Our approach to the third of these goals, in particular, is based on empirical evidence accumulated from several real and artificial DN. It is also independent of the particularities of TalkMine or any other specific recommendation algorithm. Indeed, our approach aims at a general characterization of how associative knowledge is stored in DN. Goals 1 and 2 are detailed in sections 5 and 6. Goal 3 is detailed in section 7. Sections 2 to 4 outline the foundations of semi-metric behavior required for later sections