A Focused Crawler Based on Correlation Analysis

Qiuli Qin,Xin Peng
DOI: https://doi.org/10.14257/ijfgcn.2014.7.6.02
2014-01-01
International Journal of Future Generation Communication and Networking
Abstract:With the rapid development of network and information technology, there is a wealth of huge amounts of data on the internet. But it's a major problem faced by the majority of researchers how to effectively filter out a particular subject or field of information from these data. In this paper, we try to builder a focused crawler based on vector space model and TF-IDF text correlation analysis. We take the seed URL as a collection entrance and fetch web pages from internet. Then analysis page information though technological like web content extraction, page link analysis technology and get the main content of one page. By the correlation analysis method based on VSM and TF-IDF text, we calculation the correlation between pages and the topics what have been defined, so we can achieve the purpose of the focus areas of the web.
What problem does this paper attempt to address?