The information available in electronic form continues to grow at an exponential rate and this trend is expected to continue. Although traditional search engines like Alta Vista can address common information needs, they ignore the often valuable information that is “hidden” behind search interfaces, the so-called “hidden web”.
Automating the classification of “hidden web” resources is challenging, since the contents of these collections are available only by querying, not by traditional crawling. For example, consider the PubMed medical database from the National Library of Medicine, which stores medical bibliographic information and links to full-text journals accessible through the web. This database is accessible through a query interface. A query to PubMed with keyword “cancer” returns 1,313,266 matches, which are high-quality citations to medical articles, stored locally at the PubMed site. The contents of PubMed are not “crawlable” by traditional search engines. Thus, a query on AltaVista for all the pages in the PubMed site with keyword “cancer” returns only 16,380 matches. Hence, techniques that need to have the documents available for inspection are not applicable to analyze and classify the “hidden web” resources.
The ability to access these resources and organize them for subsequent use is a central component of the Digital Libraries Initiative – Phase 2 (DLI2) project at Columbia University. The project is named PERSIVAL and its main goal is to provide personalized access to a distributed patient care digital library with all kinds of collections. The manual inspection and classification of these resources is a non-scalable solution, so we developed a novel technique to automate this task.