Classifying and Searching Hidden-Web Text Databases

The World-Wide Web continues to grow rapidly, which makes exploiting all available information a challenge.  Search engines such as Google index an unprecedented amount of information, but still do not provide access to valuable content in text databases “hidden” behind search interfaces.  For example, current search engines largely ignore the contents of the Library of Congress, the US Patent and Trademark database, newspaper archives, and many other valuable sources of information because their contents are not “crawlable.”  However, users should be able to find the information that they need with as little effort as possible, regardless of whether this information is crawlable or not.  As a significant step towards this goal, we have designed algorithms that support browsing and searching -the two dominant ways of finding information on the web- over “hidden-web” text databases.


To support browsing, we have developed QProber, a system that automatically categorizes hidden-web text databases in a classification scheme, according to their topical focus.  QProber categorizes databases without retrieving any document.  Instead, QProber uses just the number of matches generated from a small number of topically focused query probes.  The query probes are automatically generated using state-of-the-art supervised machine learning techniques and are typically short.  QProber’s classification approach is sometimes orders of magnitude faster than approaches that require document retrieval.


To support searching, we have developed crucial building blocks for constructing sophisticated metasearchers, which search over many text databases at once through a unified query interface.  For scalability and effectiveness, it is crucial for a metasearcher to have a good database selection component and send queries only to databases with relevant content.  Usually, database selection algorithms rely on statistics that characterize the contents of each database.


Unfortunately, many hidden-web text databases are completely autonomous and do not report any summaries of their contents.  To build content summaries for such databases, we extract a small, topically focused document sample from each database during categorization and use it to build the respective content summaries.  A potential problem with content summaries derived from document samples is that any reasonably small sample will suffer from data sparseness and will not contain many words that appear in the database.  To enhance the sparse samples and improve the database selection decisions, we exploit the fact that topically similar databases tend to have similar vocabularies, so samples extracted from databases with similar topical focus can complement each other.  We have developed two database selection algorithms that exploit this observation.  The first algorithm proceeds hierarchically and selects first the best category for a query and then sends the query to the appropriate databases in the chosen category.  The second database selection algorithm uses “shrinkage,” a statistical technique for improving parameter estimation in the face of sparse data, to enhance the database content summaries with category-specific words.  The shrinkage-enhanced summaries characterize the database contents better than their “unshrunk” counterparts do, and in turn help produce significantly more relevant database selection decisions and overall search results.


Content summaries of static databases do not need to change over time.  However, databases are rarely static and the statistical summaries that describe their contents need to be updated periodically to reflect content changes.  To understand how real-world databases change over time and how these changes propagate to the database content summaries, we studied how the content summaries of 152 real web databases changed every week, for a period of 52 weeks.  Then, we used “survival analysis” techniques to examine which parameters can help predict when the content summaries need to be updated.  Based on the results of this study, we designed algorithms that analyze various characteristics of the databases and their update history to predict when the content summaries need to be modified, thus avoiding overloading the databases unnecessarily.


In summary, this thesis presents building blocks that are critical to enable access to the often valuable contents of hidden-web text databases, hopefully approaching the goal of making access to these databases as easy and efficient as over regular web pages.