Abstract:This paper presents a study on strategies to enhance the translation capabilities of large language models (LLMs) in the context of machine translation (MT) tasks. The paper proposes a novel paradigm consisting of three stages: Secondary Pre-training using Extensive Monolingual Data, Continual Pre-training with Interlinear Text Format Documents, and Leveraging Source-Language Consistent Instruction for Supervised Fine-Tuning. Previous research on LLMs focused on various strategies for supervised fine-tuning (SFT), but their effectiveness has been limited. While traditional machine translation approaches rely on vast amounts of parallel bilingual data, our paradigm highlights the importance of using smaller sets of high-quality bilingual data. We argue that the focus should be on augmenting LLMs' cross-lingual alignment abilities during pre-training rather than solely relying on extensive bilingual data during SFT. Experimental results conducted using the Llama2 model, particularly on Chinese-Llama2 after monolingual augmentation, demonstrate the improved translation capabilities of LLMs. A significant contribution of our approach lies in Stage2: Continual Pre-training with Interlinear Text Format Documents, which requires less than 1B training data, making our method highly efficient. Additionally, in Stage3, we observed that setting instructions consistent with the source language benefits the supervised fine-tuning process. Experimental results demonstrate that our approach surpasses previous work and achieves superior performance compared to models such as NLLB-54B and GPT3.5-text-davinci-003, despite having a significantly smaller parameter count of only 7B or 13B. This achievement establishes our method as a pioneering strategy in the field of machine translation.

QueEn: A Large Language Model for Quechua-English Translation

GenTranslate: Large Language Models are Generative Multilingual Speech and Machine Translators

Is Translation All You Need? A Study on Solving Multilingual Tasks with Large Language Models

Hire a Linguist!: Learning Endangered Languages with In-Context Linguistic Descriptions

Teaching Large Language Models an Unseen Language on the Fly

QUILL: Quotation Generation Enhancement of Large Language Models

Machine Translation with Large Language Models: Prompting, Few-shot Learning, and Fine-tuning with QLoRA

Qwen Technical Report

RQ-RAG: Learning to Refine Queries for Retrieval Augmented Generation

SeqGPT: An Out-of-the-box Large Language Model for Open Domain Sequence Understanding

Refining Translations with LLMs: A Constraint-Aware Iterative Prompting Approach

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Tuning Large language model for End-to-end Speech Translation

A Novel Paradigm Boosting Translation Capabilities of Large Language Models

Enhancing Document-level Translation of Large Language Model via Translation Mixed-instructions

OMGEval: an Open Multilingual Generative Evaluation Benchmark for Large Language Models

BigTranslate: Augmenting Large Language Models with Multilingual Translation Capability over 100 Languages

Research on Methods to Enhance Machine Translation Quality Between Low-Resource Languages and Chinese Based on ChatGPT

Towards Making the Most of LLM for Translation Quality Estimation.

Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities

Multilingual Machine Translation with Large Language Models: Empirical Results and Analysis