Abstract:The exponential growth of digital data in recent years has spurred a significant interest in natural language processing (NLP) and sentiment analysis. However, the effectiveness of NLP models heavily relies on the availability of large, annotated datasets, which are often scarce or entirely absent for numerous languages, including Turkish. This scarcity of annotated data for Turkish presents a formidable obstacle in developing NLP models for the language. To overcome this challenge, various techniques have been proposed to augment the size of annotated datasets, with text data augmentation emerging as a promising solution. Text data augmentation involves the generation of synthetic data by transforming existing data, thus expanding the diversity and volume of the annotated dataset. While this technique has shown remarkable success in bolstering the performance of NLP models, its exploration in the context of Turkish and other low-resource languages has been limited. This paper introduces a novel ensemble approach to text data augmentation tailored for Turkish text sentiment classification. Our approach integrates both task-specific and universal transformations, capitalizing on the strengths of each to enrich the training dataset. We evaluate our proposed approach on the TRSAv1 dataset and compare it with established data augmentation techniques. The experimental results demonstrate that our ensemble method achieves superior accuracy in sentiment classification compared to conventional techniques. Additionally, we conduct an in-depth analysis to assess the impact of individual transformation functions on classification performance. Our contribution lies in bridging the gap in research on data augmentation techniques tailored to Turkish NLP, emphasizing the need for more advanced ensemble methods, and offering benchmarking results that pave the way for the development of precise NLP models not only for Turkish but also for other low-resource languages.

Data and Representation for Turkish Natural Language Inference

Bridging the Bosphorus: Advancing Turkish Large Language Models through Strategies for Low-Resource Language Adaptation and Benchmarking

Advancing natural language processing (NLP) applications of morphologically rich languages with bidirectional encoder representations from transformers (BERT): an empirical case study for Turkish

Development of Deep Learning Based Natural Language Processing Model for Turkish

Improving Turkish Text Sentiment Classification Through Task-Specific and Universal Transformations: An Ensemble Data Augmentation Approach

Improving Natural Language Inference in Arabic using Transformer Models and Linguistically Informed Pre-Training

BERT2D: Two Dimensional Positional Embeddings for Efficient Turkish NLP

GECTurk: Grammatical Error Correction and Detection Dataset for Turkish

Fine-tuning Transformer-based Encoder for Turkish Language Understanding Tasks

ArEntail: manually-curated Arabic natural language inference dataset from news headlines

TurkishBERTweet: Fast and reliable large language model for social media analysis

Optimizing Large Language Models for Turkish: New Methodologies in Corpus Selection and Training

TURNA: A Turkish Encoder-Decoder Language Model for Enhanced Understanding and Generation

Comparison of Pre-trained Language Models for Turkish Address Parsing

Natural language processing in law: Prediction of outcomes in the higher courts of Turkey

Comparison of Turkish Word Representations Trained on Different Morphological Forms

Translation Aligned Sentence Embeddings for Turkish Language

VNLP: Turkish NLP Package

Organic Data-Driven Approach for Turkish Grammatical Error Correction and LLMs

SNLI Indo: A recognizing textual entailment dataset in Indonesian derived from the Stanford Natural Language Inference dataset

Transforming Question Answering Datasets Into Natural Language Inference Datasets