Abstract:Large language models (LLMs) have significantly advanced various natural language processing (NLP) tasks. Recent research indicates that moderately-sized LLMs often outperform larger ones after task-specific fine-tuning. This study focuses on adapting LLMs for document-level machine translation (DocMT) for specific language pairs. We first investigate the impact of prompt strategies on translation performance and then conduct extensive experiments using two fine-tuning methods, three LLM backbones, and 18 translation tasks across nine language pairs. Our results show that specialized models can sometimes surpass GPT-4 in translation performance but still face issues like off-target translation due to error propagation in decoding. We provide an in-depth analysis of these LLMs tailored for DocMT, examining translation errors, discourse phenomena, strategies for training and inference, the data efficiency of parallel documents, recent test set evaluations, and zero-shot crosslingual transfer. Our findings highlight the strengths and limitations of LLM-based DocMT models and provide a foundation for future research.

What problem does this paper attempt to address?

### Problems Addressed by the Paper This paper aims to address several key issues in Document-Level Machine Translation (DOCMT), particularly how to leverage Large Language Models (LLMs) to improve translation performance for specific language pairs. Specifically, the research focuses on the following aspects: 1. **Adaptability Research**: Exploring how different fine-tuning methods (such as Parameter-Efficient Fine-Tuning, PEFT, and Full Fine-Tuning, FFT) and various prompting strategies can enable medium-sized LLMs to excel in document-level translation tasks. 2. **Performance Evaluation**: Conducting extensive experiments to evaluate the performance of different LLM backbone models (such as LLAMA 2-7B, BLOOM-7B, and VICUNA-7B) across 18 translation tasks involving 9 language pairs. 3. **Error Analysis**: Conducting an in-depth analysis of the types of errors made by LLMs in document-level translation, particularly the issue of "off-target translation," which frequently occurs due to error propagation during the decoding process. 4. **Cross-Language Zero-Shot Transfer**: Investigating the zero-shot cross-language transfer capabilities of LLMs on unseen language pairs to enhance their effectiveness and understanding in document-level translation tasks. 5. **Data Efficiency**: Exploring the data efficiency of parallel documents, i.e., the effectiveness of fine-tuning on limited datasets and the data requirements of different fine-tuning strategies. ### Main Findings 1. **Selective Excellence**: The study found that fine-tuned medium-sized LLMs can outperform GPT-4-TURBO in certain translation tasks, but still face off-target translation issues in other tasks, mainly due to error propagation during the decoding process. 2. **Fine-Tuning Strategies**: The PEFT method generally outperforms the FFT method, but the FFT method shows better data efficiency, requiring only about 1% of the total dataset to achieve performance comparable to models trained on the full dataset. 3. **Latest Test Set Evaluation**: When evaluated on the WMT2023 test set, LLM-based DOCMT models demonstrated better generalization capabilities on out-of-domain texts compared to traditional DOCMT models. 4. **Advantages of Base LLMs**: The research shows that base LLMs perform better in task-specific supervised fine-tuning compared to instruction-tuned LLMs and are more effective in zero-shot cross-language transfer. ### Conclusion This research demonstrates the potential and limitations of LLMs in document-level machine translation through extensive experiments and provides a crucial foundation for future research. The study emphasizes the importance of prompting strategies, fine-tuning methods, and data efficiency in improving the translation performance of LLMs.

Adapting Large Language Models for Document-Level Machine Translation

Document-Level Machine Translation with Large Language Models

Enhancing Document-level Translation of Large Language Model via Translation Mixed-instructions

Multilingual Machine Translation with Large Language Models: Empirical Results and Analysis

Instruction-Tuned LLMs Succeed in Document-Level MT Without Fine-Tuning -- But BLEU Turns a Blind Eye

Efficiently Exploring Large Language Models for Document-Level Machine Translation with In-context Learning

How Much Data is Enough Data? Fine-Tuning Large Language Models for In-House Translation: Performance Evaluation Across Multiple Dataset Sizes

What do Large Language Models Need for Machine Translation Evaluation?

Adaptive Machine Translation with Large Language Models

Large language models effectively leverage document-level context for literary translation, but critical errors persist

A Paradigm Shift in Machine Translation: Boosting Translation Performance of Large Language Models

Quality or Quantity? On Data Scale and Diversity in Adapting Large Language Models for Low-Resource Translation

A Novel Paradigm Boosting Translation Capabilities of Large Language Models

Getting More from Less: Large Language Models are Good Spontaneous Multilingual Learners

A Paradigm Shift: The Future of Machine Translation Lies with Large Language Models

How Multilingual Are Large Language Models Fine-Tuned for Translation?

The Fine-Tuning Paradox: Boosting Translation Quality Without Sacrificing LLM Abilities

Salute the Classic: Revisiting Challenges of Machine Translation in the Age of Large Language Models

Document-Level Language Models for Machine Translation

Large Language Models "Ad Referendum": How Good Are They at Machine Translation in the Legal Domain?

Large Language Model-guided Document Selection