Abstract:Extreme Multi-label Text Classification (XMC) involves learning a classifier that can assign an input with a subset of most relevant labels from millions of label choices. Recent works in this domain have increasingly focused on a symmetric problem setting where both input instances and label features are short-text in nature. Short-text XMC with label features has found numerous applications in areas such as query-to-ad-phrase matching in search ads, title-based product recommendation, prediction of related searches. In this paper, we propose Gandalf, a novel approach which makes use of a label co-occurrence graph to leverage label features as additional data points to supplement the training distribution. By exploiting the characteristics of the short-text XMC problem, it leverages the label features to construct valid training instances, and uses the label graph for generating the corresponding soft-label targets, hence effectively capturing the label-label correlations. Surprisingly, models trained on these new training instances, although being less than half of the original dataset, can outperform models trained on the original dataset, particularly on the PSP@k metric for tail labels. With this insight, we aim to train existing XMC algorithms on both, the original and new training instances, leading to an average 5% relative improvements for 6 state-of-the-art algorithms across 4 benchmark datasets consisting of up to 1.3M labels. Gandalf can be applied in a plug-and-play manner to various methods and thus forwards the state-of-the-art in the domain, without incurring any additional computational overheads.

On Data Augmentation for Extreme Multi-label Classification

ADAM: An Attentional Data Augmentation Method for Extreme Multi-label Text Classification

Fine-Grained AutoAugmentation for Multi-Label Classification

Label-Specific Feature Augmentation for Long-Tailed Multi-Label Text Classification

Compositional Generalization for Multi-label Text Classification: A Data-Augmentation Approach

Label-aware Document Representation via Hybrid Attention for Extreme Multi-Label Text Classification

Multi-label Text Classification Model Based on Multi-level Constraint Augmentation and Label Association Attention

Learning label-label correlations in Extreme Multi-label Classification via Label Features

Data Augmentation Methods for Enhancing Robustness in Text Classification Tasks

GUDN: A novel guide network with label reinforcement strategy for extreme multi-label text classification

AttentionXML: Label Tree-based Attention-Aware Deep Model for High-Performance Extreme Multi-Label Text Classification

HAXMLNet: Hierarchical Attention Network for Extreme Multi-Label Text Classification

Emotion Classification with Data Augmentation Using Generative Adversarial Networks

Deep Learning for Extreme Multi-label Text Classification

AUG-BERT: An Efficient Data Augmentation Algorithm for Text Classification

Large Model-Based Data Augmentation for Imbalanced Text Classification

Improving Short Text Classification With Augmented Data Using GPT-3

DAGAM: Data Augmentation with Generation And Modification

Toward Robustness in Multi-label Classification: A Data Augmentation Strategy against Imbalance and Noise

Data Augmentation Using Virtual Word Insertion Techniques in Text Classification Tasks

Improving Text Classification with Large Language Model-Based Data Augmentation