The Design and Implementation of Configurable News Collection System Based on Web Crawler

Mengmeng Lu,Shuhong Wen,Yan Xiao,Pei Tian,Fang Wang
DOI: https://doi.org/10.1109/compcomm.2017.8323045
2017-01-01
Abstract:The rapid development of the Internet technology has brought explosive growth of Internet news. How to capture news from target news website has become a major challenge. Traditional crawler is not highly customizable. This paper uses web crawler technology such as regular expression and Xpath, web page analysis, and Web Magic crawler framework to realize a set of configurable news data collection system based on java. The system can realize the function of data capture, information extraction and the storage of news. The system owns high configurability. It can crawl multi-source news data.
What problem does this paper attempt to address?