Welcome to the ADMT Publication Server

Towards a Content-Provider-Friendly Web Page Crawler

DocUID: 2007-010 Full Text: PDF

Author: Jie Xu, Qinglan Li, Huiming Qu, Alexandros Labrinidis

Abstract: Search engine quality is impacted by two factors: the quality of the ranking/matching algorithm used and the freshness of the search engine's index, which maintains a ``snapshot'' of the Web. Web crawlers capture web pages and refresh the index, but this is always a never-ending quest, as web pages get updated frequently (and thus have to be re-crawled). Knowing when to re-crawl a web page is fundamentally linked to the freshness of the index, given the size of the Web today and the inherent resource constraints: re-crawling too frequently leads to wasted bandwidth, re-crawling too infrequently brings down the quality of the search engine. In this work, we address the scheduling problem for web crawlers, with the objective of optimizing the quality of the index (i.e., maximize the freshness probability of the local repository as well as of the index). Towards this, we utilize feedback from the users (content providers) on when their web pages are updated and consider the entire spectrum of collaboration, from no feedback to explicit update schedules. We propose a unified online scheduling algorithm which utilizes different levels of collaboration from content providers. Extensive experiments with real web traces demonstrate that cooperation from users plays a major role in improving search engine index quality.

Published In: Proc. of the Tenth International ACM Workshop on the Web and Databases

Pages: pp. 1-10

Place Published: Beijing, China

Year Published: 2007

Note: held in conjunction with the SIGMOD 2007 Conference

Project: UserCentric Subject Area: Others

Publication Type: Workshop Paper

Sponsor: NSF ITR ANI-0325353

Citation:Text Latex BibTex XML Jie Xu, Qinglan Li, Huiming Qu, and Alexandros Labrinidis. Towards a Content-Provider-Friendly Web Page Crawler, Proc. of the Tenth International ACM Workshop on the Web and Databases (WebDB'07), pp. 1-10, Beijing, China, June 2007.(held in conjunction with the SIGMOD 2007 Conference)