Abstract:Recent advancements in NLP have resulted in models with specialized strengths, such as processing multimodal inputs or excelling in specific domains. However, real-world tasks, like multimodal translation, often require a combination of these strengths, such as handling both translation and image processing. While individual translation and vision models are powerful, they typically lack the ability to perform both tasks in a single system. Combining these models poses challenges, particularly due to differences in their vocabularies, which limit the effectiveness of traditional ensemble methods to post-generation techniques like N-best list re-ranking. In this work, we propose a novel zero-shot ensembling strategy that allows for the integration of different models during the decoding phase without the need for additional training. Our approach re-ranks beams during decoding by combining scores at the word level, using heuristics to predict when a word is completed. We demonstrate the effectiveness of this method in machine translation scenarios, showing that it enables the generation of translations that are both speech- and image-aware while also improving overall translation quality\footnote{We will release the code upon paper acceptance.}.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to effectively combine the advantages of different models in multimodal tasks, especially when handling text and image or audio information simultaneously in translation tasks. Specifically, the paper focuses on the fact that most current natural language processing (NLP) models, although performing well on specific tasks, usually lack the ability to process multiple - modal inputs simultaneously. For example, a powerful translation model may not be able to effectively handle tasks containing visual information, and vice versa. Therefore, the paper proposes a new zero - shot ensembling strategy, which allows different models to be integrated without additional training during the decoding stage, and improves the quality of multimodal translation through word - level re - ranking. ### Main problems 1. **Model integration in multimodal tasks**: Existing models are usually specialized in a single task, such as translation or image processing, but perform poorly in tasks that require processing multiple - modal inputs simultaneously. 2. **Vocabulary mismatch**: There are vocabulary differences between different models, which limit the effectiveness of traditional integration methods, especially during the decoding process. 3. **Online re - ranking**: Existing re - ranking methods (such as N - best list re - ranking) can only be carried out after generation and cannot affect the model's decision - making in real - time during the decoding process. ### Solutions The paper proposes an online re - ranking algorithm, which can dynamically combine the outputs of multiple models during the decoding process and is achieved through the following key steps: 1. **Word - level re - ranking**: During the decoding process, information from different models is integrated through word - level re - ranking to ensure that each word can be correctly evaluated after being generated. 2. **Plug - in method**: This method does not require additional training or task - specific data, allowing different models to be combined and used flexibly. 3. **Context - aware translation**: Verified through experiments, this method can effectively combine the advantages of different models and improve translation quality, especially in cases where multimodal information is required. ### Experimental verification The paper verifies the effectiveness of this method through multiple test sets, including: - **Unimodal translation**: The WMT 2022 English - to - German test set is used to evaluate the translation quality. - **Multimodal translation**: The MuST - SHE test set is used to evaluate the ability to handle gender ambiguity, and the CoMMuTE test set is used to evaluate the effect of image - assisted translation. ### Results - **Improvement in translation quality**: The online re - ranking method is significantly superior to individual models and traditional offline re - ranking methods in multiple indicators. - **Improvement in multimodal tasks**: In handling gender - ambiguity and image - assisted translation tasks, this method can generate translation results more accurately and improve the overall translation quality. Through these methods and experiments, the paper shows how to effectively combine the advantages of different models in multimodal tasks, thereby improving translation quality and robustness.

Plug, Play, and Fuse: Zero-Shot Joint Decoding via Word-Level Re-ranking Across Diverse Vocabularies

A Simple, Fast Diverse Decoding Algorithm for Neural Generation

Decoupled Vocabulary Learning Enables Zero-Shot Translation from Unseen Languages

The Missing Ingredient in Zero-Shot Neural Machine Translation

Improving Zero-shot Neural Machine Translation on Language-specific Encoders-Decoders

Improving Massively Multilingual Neural Machine Translation and Zero-Shot Translation

Improving Zero-shot Translation with Language-Independent Constraints

Improved Zero-shot Neural Machine Translation Via Ignoring Spurious Correlations.

Z-LaVI: Zero-Shot Language Solver Fueled by Visual Imagination

Subword Segmentation and a Single Bridge Language Affect Zero-Shot Neural Machine Translation

Joint Decoding with Multiple Translation Models.

Modular Speech-to-Text Translation for Zero-Shot Cross-Modal Transfer

EBBS: An Ensemble with Bi-Level Beam Search for Zero-Shot Machine Translation

Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation

ZeroNLG: Aligning and Autoencoding Domains for Zero-Shot Multimodal and Multilingual Natural Language Generation

Improving Zero-Shot Translation of Low-Resource Languages

Improving Zero-shot Cross-domain Slot Filling Via Transformer-based Slot Semantics Fusion

An Eye for an Ear: Zero-shot Audio Description Leveraging an Image Captioner using Audiovisual Distribution Alignment

UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language Understanding

On the Off-Target Problem of Zero-Shot Multilingual Neural Machine Translation

Early Embedding and Late Reranking for Video Captioning