Ancient Korean Archive Translation: Comparison Analysis on Statistical phrase alignment, LLM in-context learning, and inter-methodological approach

Sojung Lucia Kim,Taehong Jang,Joonmo Ahn
2024-07-16
Abstract:This study aims to compare three methods for translating ancient texts with sparse corpora: (1) the traditional statistical translation method of phrase alignment, (2) in-context LLM learning, and (3) proposed inter methodological approach - statistical machine translation method using sentence piece tokens derived from unified set of source-target corpus. The performance of the proposed approach in this study is 36.71 in BLEU score, surpassing the scores of SOLAR-10.7B context learning and the best existing Seq2Seq model. Further analysis and discussion are presented.
Computation and Language
What problem does this paper attempt to address?
The paper aims to address the problem of translating ancient texts, especially in cases where the corpus is sparse. Specifically, the study compares three different translation methods: 1. **Traditional Statistical Translation Method** (Phrase Alignment): This is a method based on existing aligned corpora, which translates by statistically analyzing the correspondence between the source language and the target language. 2. **Context-Based Large Language Model (LLM) Learning**: This method utilizes large language models (such as XGLM and SOLAR) to learn within specific contexts and generate translation results. 3. **Cross-Methodological Approach**: This method combines statistical machine translation techniques and Sentence Piece tokenization, extracting sentence piece tokens from a unified source-target corpus to improve translation quality. The main contribution of the paper is the proposal of a new cross-methodological approach that improves the accuracy of ancient text translation by using Sentence Piece tokenization technology. Experimental results show that the new method achieved a BLEU score of 36.71, surpassing some existing advanced models. Additionally, the paper analyzes the translation performance of ancient texts of different lengths under different methods and provides a detailed comparative analysis.