KazSAnDRA: Kazakh Sentiment Analysis Dataset of Reviews and Attitudes

Rustem Yeshpanov,Huseyin Atakan Varol
2024-04-10
Abstract:This paper presents KazSAnDRA, a dataset developed for Kazakh sentiment analysis that is the first and largest publicly available dataset of its kind. KazSAnDRA comprises an extensive collection of 180,064 reviews obtained from various sources and includes numerical ratings ranging from 1 to 5, providing a quantitative representation of customer attitudes. The study also pursued the automation of Kazakh sentiment classification through the development and evaluation of four machine learning models trained for both polarity classification and score classification. Experimental analysis included evaluation of the results considering both balanced and imbalanced scenarios. The most successful model attained an F1-score of 0.81 for polarity classification and 0.39 for score classification on the test sets. The dataset and fine-tuned models are open access and available for download under the Creative Commons Attribution 4.0 International License (CC BY 4.0) through our GitHub repository.
Computation and Language
What problem does this paper attempt to address?
The paper attempts to address the issues of data scarcity and insufficient automated classification capabilities in the field of Kazakh sentiment analysis. Specifically: 1. **Data Scarcity**: In the field of Kazakh sentiment analysis, there is currently a lack of publicly available datasets, which severely limits the development of related research. The establishment of the KazSAnDRA dataset aims to fill this gap by providing a large-scale, high-quality Kazakh sentiment analysis dataset. 2. **Insufficient Automated Classification Capabilities**: In existing Kazakh sentiment analysis research, the performance of automated classification is generally low. This paper aims to improve the automated classification capabilities of Kazakh sentiment analysis by developing and evaluating four machine learning models (mBERT, XLM-R, RemBERT, and mBART-50), particularly in tasks of polarity classification (positive or negative) and rating classification (1 to 5). 3. **Data Imbalance Issue**: There is a significant rating imbalance issue in the Kazakh sentiment analysis dataset, where the number of samples in certain rating categories is much higher than in others. This paper explores how to address the data imbalance issue through methods such as Random Over Sampling (ROS) and Random Under Sampling (RUS) to improve the generalization ability and classification performance of the models. In summary, the main goal of this paper is to promote research and development in the field of Kazakh sentiment analysis by constructing the KazSAnDRA dataset and evaluating various machine learning models.