############## README.txt file ############################ A GENERAL EVALUATION FRAMEWORK FOR TOPICAL CRAWLERS: SUPPORT DATA AND SCRIPT by Filippo Menczer Indiana University fil at indiana.edu and Gautam Pant, Padmini Srinivasan University of Iowa {gautam-pant,padmini-srinivasan} at uiowa.edu February 2003 (Revised March 2004) ########################################################### FILES IN ARCHIVE 1. README.txt: this file 2. DMOZ.pl: script to create targets and seeds from ODP data (http://dmoz.org) to be used to evaluate topical crawlers 3. GoogleSearch.wsdl: SOAP interface for Google Web API 4. DMOZSeeds_D2.txt: example seeds file with DIST = 2 5. DMOZTargets_D2.txt: example targets file with MAX_DEPTH = 2 6. COPYING: GNU General Public License ########################################################### The script and data files are released in association with, and implement/illustrate algorithms described in, the following paper: @article{Srinivasan02evaluation, author = {Srinivasan, P and Pant, G and Menczer, F}, title = {A General Evaluation Framework for Topical Crawlers}, url = {http://www.informatics.indiana.edu/fil/Papers/crawl_framework.pdf}, journal = {Information Retrieval}, year = {2004}, } Please refer to the paper for a detailed illustration of the procedures implemented in the script, and of the data files. ########################################################### TERMS OF USE FOR DMOZ.pl Copyright (C) 2002-2003 Filippo Menczer, Gautam Pant and the University of Iowa DMOZ.pl is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. DMOZ.pl is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with DMOZ.pl; if not, write to Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA ########################################################### WHAT YOU NEED TO RUN DMOZ.pl 0. Perl 1. RDF dump of the "structure" file from the Open Directory Project (a.k.a. ODP or DMOZ, http://dmoz.org). The file will be obtained by the script if so indicated by the user (see usage). Note that this file is large (over 300 MB) so sufficient space must be available on disk. The file should be available at http://dmoz.org/rdf/ (gzipped version is 42 MB as of this writing). 2. Google Web API - account key -- to obtain a key you must register by creating a Google account (see http://www.google.com/apis/ ), then get your license key and write it into the $key parameter in the script - GoogleSearch.wsdl -- file released with this script or available as part of the Google Web API download - SOAP::Lite -- Perl module for Google API SOAP interface. Install, eg, using the CPAN module. ########################################################### DATA FILES FORMATS For the target file, the script considers a hierarchy of targets rooted at an internal node T such that depth(T) = TOPIC_LEVEL. Then for each distance d = 0, 1 ... MaxDepth from T we have a set of targets. The MaxDepth parameter loosely specifies topic GENERALITY. So TOPIC_LEVEL and GENERALITY are two independent parameters that describe the topic. Format of DMOZTargets.txt: START(N): KEYWORDS(N): DESCRIPTION(N_0): ... URLs(N_0): ... DESCRIPTION(N_1): ... URLs(N_1): ... ... DESCRIPTION(N_MaxDepth): ... URLs(N_MaxDepth): ... START(N+1): ... For the seed file, the script does a back-link walk from the depth d=0 targets, using the Google API. A set of seeds is chosen at distance $dist from the targets, such that from each seed there is at least one path of $dist links (or less) to at least one of the targets. The $dist parameter is a measure of difficulty. Format of DMOZSeeds.txt: START(N): KEYWORDS(N): DESCRIPTION(N): ... URLs(N): ... START(N+1): ... ########################################################### USAGE perl DMOZ.pl where: is the name of a file with the ODP structure (if gzipped, it will be gunzipped) or '-' if this is to be downloaded from the ODP site is a filename where the targets will be saved (if it exists, it will not be overwritten; the targets in it will be used to produce seeds; the first argument will be ignored) is a filename where the seeds will be saved (if it exists; it will not be overwritten; seeds will not be redetermined) ############## END OF README ################################