Style Transfer as Data Augmentation: A Case Study on Named Entity Recognition

Shuguang Chen,Leonardo Neves,Thamar Solorio
DOI: https://doi.org/10.48550/arXiv.2210.07916
2022-10-15
Abstract:In this work, we take the named entity recognition task in the English language as a case study and explore style transfer as a data augmentation method to increase the size and diversity of training data in low-resource scenarios. We propose a new method to effectively transform the text from a high-resource domain to a low-resource domain by changing its style-related attributes to generate synthetic data for training. Moreover, we design a constrained decoding algorithm along with a set of key ingredients for data selection to guarantee the generation of valid and coherent data. Experiments and analysis on five different domain pairs under different data regimes demonstrate that our approach can significantly improve results compared to current state-of-the-art data augmentation methods. Our approach is a practical solution to data scarcity, and we expect it to be applicable to other NLP tasks.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to increase the scale and diversity of training data through style transfer as a data augmentation method in low - resource scenarios, so as to improve the performance of the named entity recognition (NER) task. Specifically, the paper focuses on how to use data in high - resource domains to generate synthetic data in cross - domain settings, thereby improving the NER task in low - resource domains. Traditional methods may not be able to significantly improve performance when applied to low - resource domains, because the data in low - resource domains is not comparable to that in high - resource domains in terms of scale and diversity. In addition, directly using the rich data in high - resource domains may cause problems due to differences in data distribution (such as language shift) and inconsistent features (such as category mismatch). Therefore, the paper proposes a new method to effectively convert texts in high - resource domains to low - resource domains by changing the style - related attributes of texts, and generate synthetic data for training. Meanwhile, in order to ensure the generation of valid and coherent data, the paper designs a constrained decoding algorithm and proposes a series of key components for data selection. Experimental results show that this method can significantly outperform the existing state - of - the - art data augmentation methods under different data settings.