Abstract:The capability to jointly process multi-modal information is becoming an essential task. However, the limited number of paired multi-modal data and the large computational requirements in multi-modal learning hinder the development. We propose a novel Tri-Modal Translation (TMT) model that translates between arbitrary modalities spanning speech, image, and text. We introduce a novel viewpoint, where we interpret different modalities as different languages, and treat multi-modal translation as a well-established machine translation problem. To this end, we tokenize speech and image data into discrete tokens, which provide a unified interface across modalities and significantly decrease the computational cost. In the proposed TMT, a multi-modal encoder-decoder conducts the core translation, whereas modality-specific processing is conducted only within the tokenization and detokenization stages. We evaluate the proposed TMT on all six modality translation tasks. TMT outperforms single model counterparts consistently, demonstrating that unifying tasks is beneficial not only for practicality but also for performance.

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper aims to address the issue of Multi-Modal Translation (MMT) in cross-modal information processing, specifically the arbitrary conversion between the three modalities of image, speech, and text. Specifically, the paper proposes a new Tri-Modal Translation (TMT) framework that can freely translate between images, speech, and text. The main contributions of the paper include: 1. **Tri-Modal Translation**: For the first time, it explores the translation between images, speech, and text by discretizing all modalities. 2. **Modalities as Languages**: It proposes a novel perspective of treating different modalities as different languages, thereby transforming the multi-modal translation problem into a classic Neural Machine Translation (NMT) problem. 3. **Data Efficiency**: It significantly reduces the computational burden through discretization techniques, allowing the model to be efficiently trained on large-scale data. 4. **Task Unification**: It demonstrates that using data augmentation strategies such as Back Translation (BT), single-modal data can be used to train multi-modal models, thereby improving performance. ### Background and Challenges - **Complexity of Multi-Modal Systems**: The differences in the characteristics of different modalities require multi-modal systems to handle both modality-specific features and inter-modal conversion, increasing the computational burden. - **Data Scarcity**: Multi-modal paired data (e.g., speech-image-text) is more scarce compared to single-modal data, limiting the development of multi-modal models. ### Method Overview 1. **Modality Discretization**: Pre-trained modality-specific tokenizers are used to discretize images, speech, and text into discrete tokens. 2. **Multi-Modal Encoder-Decoder Architecture**: A shared-parameter multi-modal encoder-decoder architecture is adopted, taking the discretized tokens as input and output. 3. **Modality Type Embedding**: Modality type embeddings are added in the encoder and decoder to distinguish different input and output modalities. 4. **Back Translation**: Pseudo-modal data is generated through back translation to further enrich the training data and improve model performance. ### Experimental Results - **Image-to-Text Translation**: TMT outperforms single-task models on the COCO and Flickr8k datasets. - **Image-to-Speech Translation**: TMT performs well in the direct synthesis task from image to speech, even approaching methods based on original images. - **Text-Driven Image Synthesis**: TMT achieves the best CLIP similarity in the text-to-image synthesis task, indicating that the generated images are highly consistent with the given text descriptions. - **Speech-Driven Image Synthesis**: TMT achieves the task of directly generating high-resolution images (768 × 768) from speech for the first time, with performance close to text-to-image synthesis tasks. - **Automatic Speech Recognition (ASR) and Text-to-Speech Synthesis (TTS)**: TMT also performs well in ASR and TTS tasks, significantly outperforming existing methods. ### Conclusion The TMT framework proposed in the paper not only performs well in multiple multi-modal tasks but also improves training efficiency through task unification and data augmentation strategies. This approach provides a new direction for the future development of multi-modal models.

TMT: Tri-Modal Translation between Speech, Image, and Text by Processing Different Modalities as Different Languages

A Survey of Transformer-Based Multimodal Pre-Trained Modals.

Multimodal Transformer For Multimodal Machine Translation

HybridVocab: Towards Multi-Modal Machine Translation Via Multi-Aspect Alignment

E2TIMT: Efficient and Effective Modal Adapter for Text Image Machine Translation

Progressive modality-complement aggregative multitransformer for domain multi-modal neural machine translation

EMMeTT: Efficient Multimodal Machine Translation Training

Multimodal Pretraining from Monolingual to Multilingual

Found in Translation: Learning Robust Joint Representations by Cyclic Translations Between Modalities

Layer-Level Progressive Transformer With Modality Difference Awareness for Multi-Modal Neural Machine Translation

Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models

High-Modality Multimodal Transformer: Quantifying Modality & Interaction Heterogeneity for High-Modality Representation Learning

Meta-Transformer: A Unified Framework for Multimodal Learning

TextMI: Textualize Multimodal Information for Integrating Non-verbal Cues in Pre-trained Language Models

MMoT: Mixture-of-Modality-Tokens Transformer for Composed Multimodal Conditional Image Synthesis

Multi-modal Neural Machine Translation with Deep Semantic Interactions.

RetrievalMMT: Retrieval-Constrained Multi-Modal Prompt Learning for Multi-Modal Machine Translation

Latent Variable Model for Multi-modal Translation

Multi-grained visual pivot-guided multi-modal neural machine translation with text-aware cross-modal contrastive disentangling

MMTrans-MT: A Framework for Multimodal Emotion Recognition Using Multitask Learning

Cross-Modal Multi-Tasking for Speech-to-Text Translation via Hard Parameter Sharing