On the Research and Design of Deep Web Crawler

ZHENG Dongdong,ZHAO Pengpeng,CUI Zhiming
DOI: https://doi.org/10.3321/j.issn:1000-0054.2005.09.037
2005-01-01
Abstract:As an ever-increasing amount of information on the web today is available through search interfaces, users have to key in aset of Keywords in order to access the pages from certain web sites, which are often referred to as the hidden web or the deep web. Sincethere is no static links to the hidden web pages, search engines cannot discover and index such pages. However, according to recentstudies, the content provided by many hidden web sites is often of very high quality and can be extremely valuable to many users. Howto build an effective hidden web crawler that can autonomously discover and download pages from the hidden web is studied. Since theonly entry point to a hidden web site is a query interface, the main challenge to a hidden web crawler is how to automatically generatemeaningful queries for issue to the site. A theoretical framework to investigate the query generation problem for the hidden web and wepropose effective policies for generating queries automatically is provided. Experiment shows that these policies are effective.
What problem does this paper attempt to address?