Schema driven and topic specific web crawling

Qi Guo,Hang Guo,Zhiqiang Zhang,Jing Sun,Jianhua Feng
DOI: https://doi.org/10.1007/11408079_55
2005-01-01
Abstract:We propose a new approach to discover and extract topic-specific hypertext resources from the WWW. The method, called schema driven and topical crawling, allows a user to define schema and extracting rules for a specific domain of interests. It supports automatically search and extract schema-relevant web pages from the web. Different from common approaches that surf solely on web pages, our approach supports crawler to surf on a virtual network composed by concept instances and relationships. To achieve such a goal, we design an architecture that integrates several techniques including web extractor, meta-search engine and query expansion, and provide a toolkit to support it.
What problem does this paper attempt to address?