The Case for Evaluating Multimodal Translation Models on Text Datasets

Vipin Vijayan,Braeden Bowen,Scott Grigsby,Timothy Anderson,Jeremy Gwinnup
2024-03-05
Abstract:A good evaluation framework should evaluate multimodal machine translation (MMT) models by measuring 1) their use of visual information to aid in the translation task and 2) their ability to translate complex sentences such as done for text-only machine translation. However, most current work in MMT is evaluated against the Multi30k testing sets, which do not measure these properties. Namely, the use of visual information by the MMT model cannot be shown directly from the Multi30k test set results and the sentences in Multi30k are are image captions, i.e., short, descriptive sentences, as opposed to complex sentences that typical text-only machine translation models are evaluated against.
Computation and Language
What problem does this paper attempt to address?
The paper primarily focuses on the evaluation methods in the field of Multimodal Machine Translation (MMT), particularly addressing the shortcomings of existing evaluation frameworks in measuring model performance. Specifically, the paper highlights the limitations of the Multi30k dataset, which most current MMT research relies on for evaluation: 1. **Difficulty in directly evaluating the use of visual information**: Existing evaluation methods cannot directly prove whether and how MMT models effectively utilize image information to assist in the translation task. 2. **Insufficient sentence complexity**: The sentences in the Multi30k dataset are mainly image descriptions, usually short and structurally simple, unlike the complex sentences commonly found in text machine translation. 3. **Limitations of the dataset itself**: Many sentences in the Multi30k dataset can be correctly translated even without images because the sentences themselves are not ambiguous. Based on the above issues, the paper proposes a new evaluation framework aimed at better assessing two key aspects of MMT models: - **Ability to utilize visual information**: Using the CoMMuTE evaluation framework to measure whether the model can effectively use image information to resolve ambiguities in language. - **Ability to handle complex sentences**: Using the WMT news translation task test set to evaluate the model's ability to translate complex sentences. Additionally, the paper evaluates the performance of two Transformer-based MMT models (Gated Fusion and RMMT) under the newly proposed evaluation framework and compares them with a text-only translation model (FAIR-WMT19). The results show that although these two MMT models perform well on the Multi30k test set, their performance significantly drops when handling the more complex WMT news translation task test set. This indicates that the currently trained MMT models may face challenges in practical applications, especially when dealing with complex sentences. Therefore, the paper emphasizes the importance of improving evaluation methods and suggests that future research directions should include designing MMT models that can better handle the fundamental tasks of text translation.