Investigating the translation capabilities of Large Language Models trained on parallel data only

Javier García Gilabert,Carlos Escolano,Aleix Sant Savall,Francesca De Luca Fornaciari,Audrey Mash,Xixian Liao,Maite Melero
2024-06-13
Abstract:In recent years, Large Language Models (LLMs) have demonstrated exceptional proficiency across a broad spectrum of Natural Language Processing (NLP) tasks, including Machine Translation. However, previous methods predominantly relied on iterative processes such as instruction fine-tuning or continual pre-training, leaving unexplored the challenges of training LLMs solely on parallel data. In this work, we introduce PLUME (Parallel Language Model), a collection of three 2B LLMs featuring varying vocabulary sizes (32k, 128k, and 256k) trained exclusively on Catalan-centric parallel examples. These models perform comparably to previous encoder-decoder architectures on 16 supervised translation directions and 56 zero-shot ones. Utilizing this set of models, we conduct a thorough investigation into the translation capabilities of LLMs, probing their performance, the impact of the different elements of the prompt, and their cross-lingual representation space.
Computation and Language
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to explore the ability of large language models (LLMs) to perform machine translation when trained only with parallel data. Specifically, the researchers proposed the following key questions: 1. **How to train an LLM based only on parallel data**: Traditional neural machine translation (NMT) methods usually rely on an encoder - decoder architecture and improve performance through iterative processes such as instruction fine - tuning or continuous pre - training. However, these methods have not fully explored the challenges of training LLMs using only parallel data. 2. **The performance of LLMs in zero - sample translation**: The researchers hope to understand the translation ability of LLMs trained only on parallel data on unseen language pairs, that is, the zero - sample translation ability. 3. **How the model utilizes prompt information**: The researchers also explored how LLMs use different parts of the input prompt (such as source language labels, target language labels, etc.) to generate accurate translation results. To answer these questions, the researchers introduced PLUME (Parallel Language Model), which is a collection of multilingual LLMs with three 2 - billion - parameter models. The vocabulary sizes of each model are 32,000, 128,000, and 256,000 respectively, and they are trained from scratch only on parallel data centered around Catalan. The study found that these models perform well in 16 supervised translation directions and 56 zero - sample translation directions, comparable to traditional encoder - decoder architectures of similar scales. In addition, the researchers also experimentally analyzed in detail how the model uses context information in different layers to perform translation tasks, and the performance differences of different languages when using source - language - label information. These analyses not only reveal the internal working mechanism of the model but also provide a method to remove attention heads without significantly affecting performance. Finally, the researchers also studied the cross - language space learned by the model and observed the changes of this space in different attention blocks. Overall, this paper aims to explore and understand the performance and working mechanism of LLMs trained only on parallel data in machine translation tasks, providing valuable insights for further optimizing and improving NMT models.