Computing, storage and network |
Software and data
For your class project, it is possible that you will require significant cpu,
storage, and/or network bandwidth resources. If you are not a CS student, see
the instructor to discuss your options.
If you are a CS student, the following FAQs describe the CS computer systems
for cpu-intensive processing and the storage facilities available to students:
Please follow the guidelines for the use of these facilities as
mentioned in these documents. The following additional considerations
should be followed when doing processing that is likely to generate
high volumes of network traffic:
- Processing that will generate sustained periods of high disk activity
should be limited to a single process. If you need to have multiple
computers and/or processes doing simultaneous, high-bandwidth I/O to
central storage facilities please get approval first.
- Processing that will generate high volumes of network traffic to non-IU
systems should be limited to no more than 200Kbps of sustained traffic.
Please get approval before running processing that will exceed this
level for more than 1 hour.
- Running any process that will systematically scan ranges of IP addresses
or TCP port numbers is prohibited. For example, using a utility like
nmap to scan a remote system for open ports is prohibited. Likewise,
scanning ranges of IP numbers for accessible systems is also prohibited.
This list is not intended to include all possible activities that are
prohibited or likely to cause system disruptions. If you are unsure if your
intended activities are within these acceptable use policies, please ask
before you proceed. Also if you feel that your project requires resources
beyond those available via the above facilities and policies, please see the
instructor so that we can discuss a suitable course of action.
- GiveALink: donate your bookmarks
to science -- could be a great source of projects
- JavaCrawlers:
A Java library for topical crawlers
- Nutch: an open-source web search
engine
- Jakarta Lucene: a
high-performance, full-featured text search engine written in Java
- Lemur: a Toolkit for
Language Modeling and Information Retrieval
- Clair Library:
intended to simplify a number of generic tasks in Natural Language Processing
(NLP) and Information Retrieval (IR)
- WIRE: Web IR Environment
including a simple format for storing a collection of web documents, a
crawler, and tools for generating stats and reports
- Terrier: modular software
platform for the rapid development of large-scale Web IR applications,
providing indexing and retrieval functionalities;
Labrador
is a distributed web crawler designed to be integrated with Terrier
- Search APIs: bundle Google, Yahoo, or Alexa
Web services into your app
- LETOR:
Benchmark Datasets for Learning to Rank from Microsoft Research Asia
- Alexa Web Search Platform:
public access to Alexa's crawler (not free)
- The Boost Graph
Library (BGL): a generic C++ library of graph algorithms developed
at the Open Systems Lab in the IU CS department. It handles large graphs
nicely and integrates (fairly) easily with existing code.
- WebGraph: a Java framework to
study the web graph;
WebGraph++
is a C++ port that bypasses some limitations imposed by the JVM
- Weka:
Data Mining Software in Java
- WebBase:
The Stanford WebBase project investigates various issues in crawling,
storage, indexing, and querying of large collections of Web pages
- LWP: The World-Wide Web
library for Perl (doc)
- Libwww: the W3C Protocol
Library
- Blog posts: a collection of 10M posts from 1M weblogs (how to get the
data on a DVD)
- Bow: a toolkit
for statistical language modeling, text retrieval, classification and
clustering
- MG: an open-source
indexing and retrieval system for text, images, and textual images
- WebGlimpse: search engine
software including a web administration interface, remote link spider,
and the powerful Glimpse file indexing and query system
- ht://Dig: a complete world wide
web indexing and searching system for a domain or intranet
- SWISH-E: a fast, powerful,
flexible, free, and easy to use system for indexing collections of Web
pages or other files
- Internet Archive: a digital
library of Internet sites and other cultural artifacts in digital form,
providing free access to researchers and scholars (see also
Heritrix, the Internet Archive's
open-source, extensible, web-scale, archival-quality web crawler project)
- Classifier Code (download):
a collection of example classifier code written in Matlab, donated by Mark Meiss
- WebIR: more resources