Research on Indexing Page Collection Selection Method for Search Engine.
Liyun Ru,Zhichao Li,Yingying Wu,Shaoping Ma
DOI: https://doi.org/10.1007/978-1-4614-6880-6_30
2014-01-01
Journal of Computer Research and Development
Abstract:With the rapid development of the Internet, the number of web pages has grown explosively. There are also many pages with similar content and low-quality pages. In terms of search engine, indexing such pages is no significant effect for retrieval results but increases the search engine’s indexing and retrieval burden. This paper presents a page selection algorithm, building indexing page collection from massive web data for search engine. On the one hand, a web signature-based clustering algorithm is used to filter the similar pages to compress the size of the indexing page collection; on the other hand, it combines a variety of features of the page dimensions and user dimensions, to ensure the quality of the collection. Experiments show that the size of indexing page collection selected by the proposed algorithm is only one-third of the entire page collection, and can meet the vast majority of user click needs, with a strong practical.