Micro-blogs Data Collection Based on MapReduce

YU Liu-bao,HU Chang-jun,SU Lin-han
DOI: https://doi.org/10.3969/j.issn.1002-137x.2012.z3.040
2012-01-01
Computer Science
Abstract:Micro-blogs is not only large volumes of data but also high real-time,while it is difficult to obtain sufficient micro-blogs in a short period of time by using traditional Web text crawling methods.To solve the problem about data collection when researching the micro-blogs,this paper presents a data collection platform based on MapReduce which is set up on hadoop platform,and takes full advantage of the characteristics of the hadoop distributed framework to craw-ler micro-blogs with multi-node at the same time,greatly improving the crawling rate.To solve the problem that the input data of micro-blogs collection is too small that hadoop cannot effectively balance load,this paper presents we can effectively solve the problem with the input of a number of small files.Finally we test sina micro-blogs as an example.The results show that the system is of low cost,scalable,and of high performance.This system can be widely used in public opinion analysis,communication and social network based on the data on micro-blogs,as their basic data collection platform.
What problem does this paper attempt to address?