Abstract:Multi-label text classification (MLTC) has a wide range of real-world applications. Neural networks recently promoted the performance of MLTC models. Training these neural-network models relies on sufficient accurately labelled data. However, manually annotating large-scale multi-label text classification datasets is expensive and impractical for many applications. Weak supervision techniques have thus been developed to reduce the cost of annotating text corpus. However, these techniques introduce noisy labels into the training data and may degrade the model performance. This paper aims to deal with such noise-label problems in MLTC in both single-instance and multi-instance settings. We build a novel Neural Expectation-Maximization Framework (nEM) that combines neural networks with probabilistic modelling. The nEM framework produces text representations using neural-network text encoders and is optimized with the Expectation-Maximization algorithm. It naturally considers the noisy labels during learning by iteratively updating the model parameters and estimating the distribution of the ground-truth labels. We evaluate our nEM framework in multi-instance noisy MLTC on a benchmark relation extraction dataset constructed by distant supervision and in single-instance noisy MLTC on synthetic noisy datasets constructed by keywords supervision and label flipping. The experimental results demonstrate that nEM significantly improves upon baseline models in both single-instance and multi-instance noisy MLTC tasks. The experiment analysis suggests that our nEM framework efficiently reduces the noisy labels in MLTC datasets and significantly improves model performance.

Combining Labeled And Unlabeled Data For Spam Classification

Largemargin Classification for Combating Disguise Attacks on Spam Filters

Camouflaged Chinese Spam Content Detection with Semi-supervised Generative Active Learning.

An Ensemble Learning Approach for Addressing the Class Imbalance Problem in Twitter Spam Detection.

An enhanced EM method of semi-supervised classification based on Naive Bayesian

Conditional Semi-Supervised Data Augmentation for Spam Message Detection with Low Resource Data

A Late Multi-Modal Fusion Model for Detecting Hybrid Spam E-mail

Training SpamAssassin with Active Semi-supervised Learning

Efficient Modeling of Spam Images

Ensemble Decision for Spam Detection Using Term Space Partition Approach

A Spam Filtering Method Based on Multi-Modal Fusion

Boosting label weighted extreme learning machine for classifying multi-label imbalanced data

Combining Active Learning and Semi-Supervised Learning Based on Extreme Learning Machine for Multi-class Image Classification

Combining Neural Networks and Semantic Feature Space for Email Classification

A Hybrid Model with New Word Weighting for Fast Filtering Spam Short Texts

Classification of Spam Emails through Hierarchical Clustering and Supervised Learning

A counting-based method for massive spam mail classification

Detecting Spam E-mails with Content and Weight-based Binomial Logistic Model

A Neural Expectation-Maximization Framework for Noisy Multi-Label Text Classification

Email Classification Using Behavior and Time Features

Extracting discriminative information from e-mail for spam detection inspired by Immune System