Parallelizing the Extraction of Fresh Information from Online Social Networks

Rui Guo,Hongzhi Wang,Mengwen Chen,Jianzhong Li,Hong Gao
DOI: https://doi.org/10.1016/j.future.2015.11.021
IF: 7.307
2016-01-01
Future Generation Computer Systems
Abstract:Online social networks (OSNs) are among the hottest new services in recent years. OSNs maintain records of the lives of users, thereby providing potential resources for journalists, sociologists, and business analysts. Crawling data from social networks is a basic step during the processing and analysis of social network information. However, as OSNs become larger and the information on the network updates faster than the web pages, crawling is more difficult due to limitations in terms of bandwidth, politeness or etiquette, and computational power. To extract fresh information from OSNs in an efficient and effective manner, we propose a novel method for crawling and we also discuss a parallelization architecture for OSNs. To identify the features of OSNs, we collected data from real OSNs, analyzed them, and built a model to describe the behavior of users. Based on this model, we developed methods to predict the behavior of users. According to these predictions, we can schedule our crawler in a more reasonable manner and extract more fresh information using parallelization techniques. Our experimental results demonstrate that the proposed strategies can extract information from OSNs in an efficient and effective manner.
What problem does this paper attempt to address?