Abstract:In recent years, Large Language Models (LLMs) have demonstrated exceptional proficiency across a broad spectrum of Natural Language Processing (NLP) tasks, including Machine Translation. However, previous methods predominantly relied on iterative processes such as instruction fine-tuning or continual pre-training, leaving unexplored the challenges of training LLMs solely on parallel data. In this work, we introduce PLUME (Parallel Language Model), a collection of three 2B LLMs featuring varying vocabulary sizes (32k, 128k, and 256k) trained exclusively on Catalan-centric parallel examples. These models perform comparably to previous encoder-decoder architectures on 16 supervised translation directions and 56 zero-shot ones. Utilizing this set of models, we conduct a thorough investigation into the translation capabilities of LLMs, probing their performance, the impact of the different elements of the prompt, and their cross-lingual representation space.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to explore the ability of large language models (LLMs) to perform machine translation when trained only with parallel data. Specifically, the researchers proposed the following key questions: 1. **How to train an LLM based only on parallel data**: Traditional neural machine translation (NMT) methods usually rely on an encoder - decoder architecture and improve performance through iterative processes such as instruction fine - tuning or continuous pre - training. However, these methods have not fully explored the challenges of training LLMs using only parallel data. 2. **The performance of LLMs in zero - sample translation**: The researchers hope to understand the translation ability of LLMs trained only on parallel data on unseen language pairs, that is, the zero - sample translation ability. 3. **How the model utilizes prompt information**: The researchers also explored how LLMs use different parts of the input prompt (such as source language labels, target language labels, etc.) to generate accurate translation results. To answer these questions, the researchers introduced PLUME (Parallel Language Model), which is a collection of multilingual LLMs with three 2 - billion - parameter models. The vocabulary sizes of each model are 32,000, 128,000, and 256,000 respectively, and they are trained from scratch only on parallel data centered around Catalan. The study found that these models perform well in 16 supervised translation directions and 56 zero - sample translation directions, comparable to traditional encoder - decoder architectures of similar scales. In addition, the researchers also experimentally analyzed in detail how the model uses context information in different layers to perform translation tasks, and the performance differences of different languages when using source - language - label information. These analyses not only reveal the internal working mechanism of the model but also provide a method to remove attention heads without significantly affecting performance. Finally, the researchers also studied the cross - language space learned by the model and observed the changes of this space in different attention blocks. Overall, this paper aims to explore and understand the performance and working mechanism of LLMs trained only on parallel data in machine translation tasks, providing valuable insights for further optimizing and improving NMT models.

Investigating the translation capabilities of Large Language Models trained on parallel data only

A Novel Paradigm Boosting Translation Capabilities of Large Language Models

Multilingual Machine Translation with Large Language Models: Empirical Results and Analysis

An Empirical Study of Translation Hypothesis Ensembling with Large Language Models

Unraveling the Potential of Large Language Models in Code Translation: How Far Are We?

The Reasonableness Behind Unreasonable Translation Capability of Large Language Model

Revealing the Parallel Multilingual Learning within Large Language Models

Getting More from Less: Large Language Models are Good Spontaneous Multilingual Learners

Towards Multilingual LLM Evaluation for European Languages

Breaking Language Barriers: Cross-Lingual Continual Pre-Training at Scale

Large Language Models: A Survey

Spanish and LLM Benchmarks: is MMLU Lost in Translation?

Multilingual Large Language Models and Curse of Multilinguality

Simul-LLM: A Framework for Exploring High-Quality Simultaneous Translation with Large Language Models

Open Generative Large Language Models for Galician

Prompting PaLM for Translation: Assessing Strategies and Performance

TransLLaMa: LLM-based Simultaneous Translation System

A Recipe of Parallel Corpora Exploitation for Multilingual Large Language Models

Could We Have Had Better Multilingual LLMs If English Was Not the Central Language?

Large Language Models for Expansion of Spoken Language Understanding Systems to New Languages

How Multilingual Are Large Language Models Fine-Tuned for Translation?