Parallel Approach and Platform for Large-Scale WEB Data Extraction

Yi Shen,Shengsheng Shi,Haitao Wang,Wu Wei,Chunfeng Yuan,Yihua Huang
DOI: https://doi.org/10.1109/cbd.2013.24
2013-01-01
Abstract:As the most popular information publishing platform, the Web contains a lot of valued information of interests to users or applications. Although a lot of data extraction techniques have been studied in the last decade, it is still far away from meeting the need of real data extraction. On the one hand, most of them cannot support the whole web information extraction process involving three stages: web page navigation, data extraction and data integration, On the other hand, they cannot support parallel data extraction process for large-scale web pages. In this paper, we propose a parallel approach and platform based on the Hadoop MapReduce for large-scale web data extraction. Our approach can perform the whole three-stage web data extraction process in parallel. Experimental results show that our approach is efficient and can achieve linear speedup.
What problem does this paper attempt to address?