Text is ubiquitous and, not surprisingly, many applications rely on textual data for a variety of tasks. For example, information extraction applications retrieve documents and extract structured relations from the unstructured text in the documents. Reputation management systems downloadWeb pages to track the “buzz” around companies and products. Comparative shopping agents locate e-commerce Web sites and add the products offered in the pages to their own index.
To process a text-centric task over a text database (or the Web), we can retrieve the relevant database documents in different ways. One approach is to scan or crawl the database to retrieve its documents and process them as required by the task. While such an approach guarantees that we cover all documents that are potentially relevant for the task, this method might be unnecessarily expensive in terms of execution time. For example, consider the task of extracting information on disease outbreaks (e.g., the name of the disease, the location and date of the outbreak, and the number of affected people) as reported in news articles. This task does not require that we scan and process, say, the articles about sports in a newspaper archive. In fact, only a small fraction of the archive is of relevance to the task. For tasks such as this one, a natural alternative to crawling is to exploit a search engine index on the database to retrieve-via careful querying-the useful documents. In our example, we can use keywords that are strongly associated with disease outbreaks (e.g., World Health Organization, case fatality rate) and turn these keywords into queries to find news articles that are appropriate for the task.
The choice between a crawl- and a query-based execution strategy for a textcentric task is analogous to the choice between a scan- and an index-based execution plan for a selection query over a relation. Just as in the relational model, the choice of execution strategy can substantially affect the execution time of the task. In contrast to the relational world, however, this choice might also affect the quality of the output that is produced: while a crawl-based execution of a text-centric task guarantees that all documents are processed, a query-based execution might miss some relevant documents, hence producing potentially incomplete output, with less-than-perfect recall. The choice between crawl- and query-based execution plans can then have a substantial impact on both execution time and output recall. Nevertheless, this important choice is typically left to simplistic heuristics or plain intuition.
In this article,we introduce fundamental building blocks for the optimization of text-centric tasks. Towards this goal, we show how to rigorously analyze query- and crawl-based plans for a task in terms of both execution time and output recall. To analyze crawl-based plans, we apply techniques from statistics to model crawling as a document sampling process; to analyze query-based plans, we first abstract the querying process as a random walk on a querying graph, and then apply results from the theory of random graphs to discover relevant properties of the querying process. Our cost model reflects the fact that the performance of the execution plans depends on fundamental task-specific properties of the underlying text databases. We identify these properties and present efficient techniques for estimating the associated parameters of the cost model.