Distributed Search over the Hidden-Web: Hierarchical Database Sampling and Selection

Many valuable text databases on the web have non-crawlable contents that are “hidden” behind search interfaces.  Metasearchers are helpful tools for searching over many such databases at once through a unified query interface.  A critical task for a metasearcher to process a query efficiently and effectively is the selection of the most promising databases for the query, a task that typically relies on statistical summaries of the database contents.  Unfortunately, web accessible text databases do not generally export content summaries.  In this paper, we present an algorithm to derive content summaries from “uncooperative” databases by using “focused query probes”, which adaptively zoom in on and extract documents that are representative of the topic coverage of the databases.  Our content summaries are the first to include absolute document frequency estimates for the database words.  We also present a novel database selection algorithm that exploits both the extracted content summaries and a hierarchical classification of the databases, automatically derived during probing, to compensate for potentially incomplete content summaries.

 

Finally, we evaluate our techniques thoroughly using a variety of databases, including 50 real web-accessible text databases.  Our experiments indicate that our new content-summary construction technique is efficient and produces more accurate summaries than those from previously proposed strategies.  Also, our hierarchical database selection algorithm exhibits significantly higher precision than its flat counterparts.