Abstract:The exponential growth of digital data in recent years has spurred a significant interest in natural language processing (NLP) and sentiment analysis. However, the effectiveness of NLP models heavily relies on the availability of large, annotated datasets, which are often scarce or entirely absent for numerous languages, including Turkish. This scarcity of annotated data for Turkish presents a formidable obstacle in developing NLP models for the language. To overcome this challenge, various techniques have been proposed to augment the size of annotated datasets, with text data augmentation emerging as a promising solution. Text data augmentation involves the generation of synthetic data by transforming existing data, thus expanding the diversity and volume of the annotated dataset. While this technique has shown remarkable success in bolstering the performance of NLP models, its exploration in the context of Turkish and other low-resource languages has been limited. This paper introduces a novel ensemble approach to text data augmentation tailored for Turkish text sentiment classification. Our approach integrates both task-specific and universal transformations, capitalizing on the strengths of each to enrich the training dataset. We evaluate our proposed approach on the TRSAv1 dataset and compare it with established data augmentation techniques. The experimental results demonstrate that our ensemble method achieves superior accuracy in sentiment classification compared to conventional techniques. Additionally, we conduct an in-depth analysis to assess the impact of individual transformation functions on classification performance. Our contribution lies in bridging the gap in research on data augmentation techniques tailored to Turkish NLP, emphasizing the need for more advanced ensemble methods, and offering benchmarking results that pave the way for the development of precise NLP models not only for Turkish but also for other low-resource languages.

Text data augmentation and pre-trained Language Model for enhancing text classification of low-resource languages

To Augment or Not to Augment? A Comparative Study on Text Augmentation Techniques for Low-Resource NLP

Multilingual end-to-end ASR for low-resource Turkic languages with common alphabets

Data Augmentation using Transformers and Similarity Measures for Improving Arabic Text Classification

Data augmentation in natural language processing: a novel text generation approach for long and short text classifiers

Exploring Data Augmentation Methods on Social Media Corpora

LowCLIP: Adapting the CLIP Model Architecture for Low-Resource Languages in Multimodal Image Retrieval Task

Boosting Zero-Shot Crosslingual Performance using LLM-Based Augmentations with Effective Data Selection

Data Augmentation for Low-Resource Named Entity Recognition Using Backtranslation

Low Resource Text Classification with ULMFit and Backtranslation

Data augmentation strategies to improve text classification: a use case in smart cities

Classification of Scientific Documents in the Kazakh Language Using Deep Neural Networks and a Fusion of Images and Text

Leveraging Language Identification to Enhance Code-Mixed Text Classification

Not Enough Data? Deep Learning to the Rescue!

Trans-Tokenization and Cross-lingual Vocabulary Transfers: Language Adaptation of LLMs for Low-Resource NLP

Evaluating the Effectiveness of Data Augmentation for Emotion Classification in Low-Resource Settings

Improving Turkish Text Sentiment Classification Through Task-Specific and Universal Transformations: An Ensemble Data Augmentation Approach

An Experimental Study on Data Augmentation Techniques for Named Entity Recognition on Low-Resource Domains

Open foundation models for Azerbaijani language

Exploring Speech Enhancement for Low-resource Speech Synthesis

Data Augmentation Methods for Enhancing Robustness in Text Classification Tasks