ThaiCoref: Thai Coreference Resolution Dataset

Pontakorn Trakuekul,Wei Qi Leong,Charin Polpanumas,Jitkapat Sawatphol,William Chandra Tjhi,Attapol T. Rutherford
2024-06-10
Abstract:While coreference resolution is a well-established research area in Natural Language Processing (NLP), research focusing on Thai language remains limited due to the lack of large annotated corpora. In this work, we introduce ThaiCoref, a dataset for Thai coreference resolution. Our dataset comprises 777,271 tokens, 44,082 mentions and 10,429 entities across four text genres: university essays, newspapers, speeches, and Wikipedia. Our annotation scheme is built upon the OntoNotes benchmark with adjustments to address Thai-specific phenomena. Utilizing ThaiCoref, we train models employing a multilingual encoder and cross-lingual transfer techniques, achieving a best F1 score of 67.88\% on the test set. Error analysis reveals challenges posed by Thai's unique linguistic features. To benefit the NLP community, we make the dataset and the model publicly available at <a class="link-external link-http" href="http://www.github.com/nlp-chula/thai-coref" rel="external noopener nofollow">this http URL</a> .
Computation and Language
What problem does this paper attempt to address?
The main goal of this paper is to address the scarcity of Thai coreference resolution datasets. Specifically: - **Insufficient Datasets**: Currently, the number of Thai coreference resolution datasets is limited, with small scale and insufficient coverage, hindering the development of high-quality Thai coreference resolution models. - **Introduction of ThaiCoref Dataset**: The authors developed a large-scale Thai coreference resolution dataset called ThaiCoref, which contains over 770,000 tokens, 44,000 mentions, and more than 10,000 entities, covering four text genres (university theses, news reports, speeches, and Wikipedia articles). This dataset significantly increases in scale compared to existing Thai coreference resolution datasets and provides richer coverage. - **Model Training and Evaluation**: Using the ThaiCoref dataset, the authors trained various models and explored the effects of cross-lingual transfer learning techniques. The results showed an optimal F1 score of 67.88% on the test set. Additionally, the authors conducted both quantitative and qualitative analyses of model errors, revealing the challenges posed by the unique linguistic features of Thai. In summary, this paper aims to promote related research by constructing a larger-scale, high-quality Thai coreference resolution dataset and validating the effectiveness of these datasets in practical machine learning tasks.