Taming Prompt-Based Data Augmentation for Long-Tailed Extreme Multi-Label Text Classification.

Pengyu Xu,Mingyang Song,Ziyi Li,Sijin Lu,Liping Jing,Jian Yu
DOI: https://doi.org/10.1109/ICASSP48485.2024.10446315
2024-01-01
Abstract:In extreme multi-label text classification (XMC), labels usually follow a long-tailed distribution, where most labels only contain a small number of documents and limit the performance of XMC. Data augmentation (DA) is a simple but effective strategy to solve such low-resource problems. In this paper, we propose a prompt-based DA method called XDA, which is specifically designed for XMC. First, we employ a soft prompt during the fine-tuning process of the T5 model for label-conditional DA, thereby enabling T5 to augment samples while preserving label-compatibility. Subsequently, XDA performs sample filtering on the augmented samples through the diversity of text and the consistency of labels, which enhances the quality of the DA. In contrast to traditional sample-level DA, we propose a pair-level DA method by masking the augmented sample-label pairs of head-labels during training, effectively mitigating the long-tailed problem. Comprehensive experiments on benchmark datasets have shown that the proposed XDA outperforms the state-of-the-art counterparts.
What problem does this paper attempt to address?