Abstract:The rise of online crowdsourcing services has prompted an evolution from traditional spamming accounts, which spread unwanted advertisements and fraudulent content, into novel spammers that resemble those of normal users. Prior research has mainly focused on machine accounts and spams separately, but characteristics of new types of spammers and spamming make it difficult for traditional methods to perform well. In this paper, we integrate the study of these new types of spammers with the study of crowdturfing microblogs, investigating the mechanism of crowdsourcing and the close relationship between crowdturfing spammers and microblogs in order to detect new types of spammers and spams more precisely. We propose a novel semi-supervised learning framework for co-detecting crowdturfing microblogs and spammers by comprehensively modeling user behavior, message content, and users' following and retweeting networks. In order to meet the challenge of sparsely labeled datasets, we design an elaborate co-detection target optimal function to minimize empirical error and to permit the dissemination of sparse labels to unlabeled samples. The advantage of this framework is threefold. First, through a deep-level mining of new-type spammers, we aggregate a number of new-found features that can help us make significant distinctions between normal users and new-type spammers. Secondly, by modeling both following networks and retweeting networks, we characterize the essence of the crowdsourcing mechanism abused by spammers in crowdturfing microblog diffusion to markedly increase detection performance. Thirdly, through our optimal function based on semi-supervised methods, we overcome the problem of label sparseness, thus obtaining a more reliable capacity to deal with the challenges of big, sparsely labeled data. Extensive experiments on real datasets demonstrate that our method outperforms four baselines in various metrics (Precision-Recall, AUC values, Precision@K and so on). We also develop a robust system, the functions of which include data collection and availability analysis, spam and spammer detection, and visualization. To render our experiments replicable, we have made our dataset and codes openly available at https://github.com/sunxiangguo/Crowdturfing.

A Social Spam Detection Framework via Semi-supervised Learning.

A Semi-Supervised Framework for Social Spammer Detection

Bio-Inspired Algorithm Based Undersampling Approach and Ensemble Learning for Twitter Spam Detection

Online Social Spammer Detection

An Adaptive Social Spammer Detection Model With Semi-Supervised Broad Learning

Social Spammer and Spam Message Co-Detection in Microblogging with Social Context Regularization.

Co-Detection of Crowdturfing Microblogs and Spammers in Online Social Networks

Spammer Detection On Online Social Networks Based On Logistic Regression

Semi-Supervised Spam Detection in Twitter Stream

LSSL-SSD: Social Spammer Detection with Laplacian Score and Semi-supervised Learning.

Co-detecting Social Spammers and Spam Messages in Microblogging Via Exploiting Social Contexts

Detecting Spammers on Twitter Based on Content and Social Interaction

Detecting Spam on Sina Weibo

A Semantics and Behaviors-Collaboratively Driven Spammer Detection Method

Combating the Evolving Spammers in Online Social Networks

A cascading framework for uncovering spammers in social networks

Community Based Spammer Detection In Social Networks

Social Spammer Detection Via Convex Nonnegative Matrix Factorization

SSDMV: Semi-Supervised Deep Social Spammer Detection by Multi-view Data Fusion

Two-layer Sampling Active Learning Algorithm for Social Spammer Detection

An Ensemble Learning Approach for Addressing the Class Imbalance Problem in Twitter Spam Detection.