TMT: Tri-Modal Translation between Speech, Image, and Text by Processing Different Modalities as Different Languages

Minsu Kim,Jee-weon Jung,Hyeongseop Rha,Soumi Maiti,Siddhant Arora,Xuankai Chang,Shinji Watanabe,Yong Man Ro
2024-02-25
Abstract:The capability to jointly process multi-modal information is becoming an essential task. However, the limited number of paired multi-modal data and the large computational requirements in multi-modal learning hinder the development. We propose a novel Tri-Modal Translation (TMT) model that translates between arbitrary modalities spanning speech, image, and text. We introduce a novel viewpoint, where we interpret different modalities as different languages, and treat multi-modal translation as a well-established machine translation problem. To this end, we tokenize speech and image data into discrete tokens, which provide a unified interface across modalities and significantly decrease the computational cost. In the proposed TMT, a multi-modal encoder-decoder conducts the core translation, whereas modality-specific processing is conducted only within the tokenization and detokenization stages. We evaluate the proposed TMT on all six modality translation tasks. TMT outperforms single model counterparts consistently, demonstrating that unifying tasks is beneficial not only for practicality but also for performance.
Computation and Language,Artificial Intelligence,Computer Vision and Pattern Recognition,Audio and Speech Processing
What problem does this paper attempt to address?
### Problems Addressed by the Paper The paper aims to address the issue of Multi-Modal Translation (MMT) in cross-modal information processing, specifically the arbitrary conversion between the three modalities of image, speech, and text. Specifically, the paper proposes a new Tri-Modal Translation (TMT) framework that can freely translate between images, speech, and text. The main contributions of the paper include: 1. **Tri-Modal Translation**: For the first time, it explores the translation between images, speech, and text by discretizing all modalities. 2. **Modalities as Languages**: It proposes a novel perspective of treating different modalities as different languages, thereby transforming the multi-modal translation problem into a classic Neural Machine Translation (NMT) problem. 3. **Data Efficiency**: It significantly reduces the computational burden through discretization techniques, allowing the model to be efficiently trained on large-scale data. 4. **Task Unification**: It demonstrates that using data augmentation strategies such as Back Translation (BT), single-modal data can be used to train multi-modal models, thereby improving performance. ### Background and Challenges - **Complexity of Multi-Modal Systems**: The differences in the characteristics of different modalities require multi-modal systems to handle both modality-specific features and inter-modal conversion, increasing the computational burden. - **Data Scarcity**: Multi-modal paired data (e.g., speech-image-text) is more scarce compared to single-modal data, limiting the development of multi-modal models. ### Method Overview 1. **Modality Discretization**: Pre-trained modality-specific tokenizers are used to discretize images, speech, and text into discrete tokens. 2. **Multi-Modal Encoder-Decoder Architecture**: A shared-parameter multi-modal encoder-decoder architecture is adopted, taking the discretized tokens as input and output. 3. **Modality Type Embedding**: Modality type embeddings are added in the encoder and decoder to distinguish different input and output modalities. 4. **Back Translation**: Pseudo-modal data is generated through back translation to further enrich the training data and improve model performance. ### Experimental Results - **Image-to-Text Translation**: TMT outperforms single-task models on the COCO and Flickr8k datasets. - **Image-to-Speech Translation**: TMT performs well in the direct synthesis task from image to speech, even approaching methods based on original images. - **Text-Driven Image Synthesis**: TMT achieves the best CLIP similarity in the text-to-image synthesis task, indicating that the generated images are highly consistent with the given text descriptions. - **Speech-Driven Image Synthesis**: TMT achieves the task of directly generating high-resolution images (768 × 768) from speech for the first time, with performance close to text-to-image synthesis tasks. - **Automatic Speech Recognition (ASR) and Text-to-Speech Synthesis (TTS)**: TMT also performs well in ASR and TTS tasks, significantly outperforming existing methods. ### Conclusion The TMT framework proposed in the paper not only performs well in multiple multi-modal tasks but also improves training efficiency through task unification and data augmentation strategies. This approach provides a new direction for the future development of multi-modal models.