Abstract:Recognizing textual entailment (RTE) is an essential task in natural language processing (NLP). It is the task of determining the inference relationship between text fragments (premise and hypothesis), of which the inference relationship is either entailment (true), contradiction (false), or neutral (undetermined). The most popular approach for RTE is neural networks, which has resulted in the best RTE models. Neural network approaches, in particular deep learning, are data-driven and, consequently, the quantity and quality of the data significantly influences the performance of these approaches. Therefore, we introduce SNLI Indo, a large-scale RTE dataset in the Indonesian language, which was derived from the Stanford Natural Language Inference (SNLI) corpus by translating the original sentence pairs. SNLI is a large-scale dataset that contains premise-hypothesis pairs that were generated using a crowdsourcing framework. The SNLI dataset is comprised of a total of 569,027 sentence pairs with the distribution of sentence pairs as follows: 549,365 pairs for training, 9,840 pairs for model validation, and 9,822 pairs for testing. We translated the original sentence pairs of the SNLI dataset from English to Indonesian using the Google Cloud Translation API. The existence of SNLI Indo addresses the resource gap in the field of NLP for the Indonesian language. Even though large datasets are available in other languages, in particular English, the SNLI Indo dataset enables a more optimal development of deep learning models for RTE in the Indonesian language.

IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP

IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding

IndoNLI: A Natural Language Inference Dataset for Indonesian

IndoNLG: Benchmark and Resources for Evaluating Indonesian Natural Language Generation

NusaBERT: Teaching IndoBERT to be Multilingual and Multicultural

Domain-Specific Language Model Post-Training for Indonesian Financial NLP

IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effective Domain-Specific Vocabulary Initialization

SNLI Indo: A recognizing textual entailment dataset in Indonesian derived from the Stanford Natural Language Inference dataset

Hybrid Models for Emotion Classification and Sentiment Analysis in Indonesian Language

One Country, 700+ Languages: NLP Challenges for Underrepresented Languages and Dialects in Indonesia

IndoToD: A Multi-Domain Indonesian Benchmark For End-to-End Task-Oriented Dialogue Systems

NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local Languages

IndoRobusta: Towards Robustness Against Diverse Code-Mixed Indonesian Local Languages

Building Dialogue Understanding Models for Low-resource Language Indonesian from Scratch

Komodo: A Linguistic Expedition into Indonesia's Regional Languages

Constructing and Expanding Low-Resource and Underrepresented Parallel Datasets for Indonesian Local Languages

IndoCulture: Exploring Geographically-Influenced Cultural Commonsense Reasoning Across Eleven Indonesian Provinces

Cendol: Open Instruction-tuned Generative Large Language Models for Indonesian Languages

DriveThru: a Document Extraction Platform and Benchmark Datasets for Indonesian Local Language Archives

Utilizing Weak Supervision To Generate Indonesian Conservation Dataset

BHASA: A Holistic Southeast Asian Linguistic and Cultural Evaluation Suite for Large Language Models