A multi-level multi-label text classification dataset of 19th century Ottoman and Russian literary and critical texts

Gokcen Gokceoglu,Devrim Cavusoglu,Emre Akbas,Özen Nergis Dolcerocca

2024-07-21

Abstract:This paper introduces a multi-level, multi-label text classification dataset comprising over 3000 documents. The dataset features literary and critical texts from 19th-century Ottoman Turkish and Russian. It is the first study to apply large language models (LLMs) to this dataset, sourced from prominent literary periodicals of the era. The texts have been meticulously organized and labeled. This was done according to a taxonomic framework that takes into account both their structural and semantic attributes. Articles are categorized and tagged with bibliometric metadata by human experts. We present baseline classification results using a classical bag-of-words (BoW) naive Bayes model and three modern LLMs: multilingual BERT, Falcon, and Llama-v2. We found that in certain cases, Bag of Words (BoW) outperforms Large Language Models (LLMs), emphasizing the need for additional research, especially in low-resource language settings. This dataset is expected to be a valuable resource for researchers in natural language processing and machine learning, especially for historical and low-resource languages. The dataset is publicly available^1.

Computation and Language

What problem does this paper attempt to address?

The paper attempts to address the issue of multi-level multi-label text classification using large-scale language models (LLMs) in low-resource language environments. Specifically: 1. **Dataset Construction**: The paper introduces the first multi-level multi-label text classification dataset containing 19th-century Ottoman Turkish and Russian texts. This dataset consists of over 3,000 documents, covering literary and critical texts, and has been meticulously organized and annotated. 2. **Challenges of Low-Resource Languages**: The paper highlights the challenges of low-resource languages in natural language processing, including data scarcity, insufficient semantic representation due to tokenization processes, and thematic biases in digital texts. To address these issues, the paper attempts to apply large language models to handle these historical documents. 3. **Baseline Model Performance Evaluation**: The paper conducts baseline classification experiments using a traditional Bag of Words (BoW) Naive Bayes model and three modern large-scale language models (multilingual BERT, Falcon, and Llama-v2). The results show that in some cases, the BoW model outperforms the large-scale language models, indicating the need for further research in low-resource language environments. 4. **Cross-Language Research**: The paper pays special attention to the historical texts of 19th-century Ottoman Turkish and Russian, languages that have significant historical importance but are often overlooked in modern natural language processing. By creating such datasets, the paper aims to advance research in historical and low-resource languages. In summary, the main goal of the paper is to develop a high-quality dataset and evaluate the performance of different models in low-resource language environments to promote research and development in the related fields.

A multi-level multi-label text classification dataset of 19th century Ottoman and Russian literary and critical texts

Multi-label Sequential Sentence Classification via Large Language Model

TurkishMMLU: Measuring Massive Multitask Language Understanding in Turkish

Adaptable and Reliable Text Classification using Large Language Models

Severity Prediction in Mental Health: LLM-based Creation, Analysis, Evaluation of a Novel Multilingual Dataset

Making LLMs Worth Every Penny: Resource-Limited Text Classification in Banking

An Empirical Study on Large-Scale Multi-Label Text Classification Including Few and Zero-Shot Labels

On Classification with Large Language Models in Cultural Analytics

OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

Evaluating Large Language Models for Public Health Classification and Extraction Tasks

Datasets for Large Language Models: A Comprehensive Survey

Taxi1500: A Multilingual Dataset for Text Classification in 1500 Languages

Universal Cross-Lingual Text Classification

Can Large Language Models Serve as Effective Classifiers for Hierarchical Multi-Label Classification of Scientific Documents at Industrial Scale?

OmniCorpus: an Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

Seventeenth-Century Spanish American Notary Records for Fine-Tuning Spanish Large Language Models

Fine-Tuning Large Language Models for Scientific Text Classification: A Comparative Study

Evaluating Large Language Models for Health-Related Text Classification Tasks with Public Social Media Data

Shadows of wisdom: Classifying meta-cognitive and morally grounded narrative content via large language models

Automated Category and Trend Analysis of Scientific Articles on Ophthalmology Using Large Language Models: Development and Usability Study

LAION-5B: An open large-scale dataset for training next generation image-text models