Performance of Data Augmentation Methods for Brazilian Portuguese Text Classification

Marcellus Amadeus,Paulo Branco
2023-04-06
Abstract:Improving machine learning performance while increasing model generalization has been a constantly pursued goal by AI researchers. Data augmentation techniques are often used towards achieving this target, and most of its evaluation is made using English corpora. In this work, we took advantage of different existing data augmentation methods to analyze their performances applied to text classification problems using Brazilian Portuguese corpora. As a result, our analysis shows some putative improvements in using some of these techniques; however, it also suggests further exploitation of language bias and non-English text data scarcity.
Computation and Language
What problem does this paper attempt to address?
The problem this paper attempts to address is: how to improve the performance of Brazilian Portuguese text classification tasks through data augmentation techniques and increase the model's generalization ability. Specifically, the authors focus on the following aspects: 1. **Data Scarcity**: For non-English languages (such as Brazilian Portuguese), high-quality training data is often scarce, which limits the model's performance. 2. **Applicability of Existing Methods**: Most existing data augmentation techniques are developed based on English corpora, and it is unclear how effective these techniques are for other languages. 3. **Language Specificity**: Different languages may require specific data augmentation methods to ensure significant performance improvements. To address these issues, the authors re-examined and applied various existing data augmentation methods, conducting experiments using Brazilian Portuguese text datasets to verify the effectiveness and generalizability of these methods.