Meta-Transformer: A Unified Framework for Multimodal Learning

Yiyuan Zhang,Kaixiong Gong,Kaipeng Zhang,Hongsheng Li,Yu Qiao,Wanli Ouyang,Xiangyu Yue

2023-07-20

Abstract:Multimodal learning aims to build models that can process and relate information from multiple modalities. Despite years of development in this field, it still remains challenging to design a unified network for processing various modalities ($\textit{e.g.}$ natural language, 2D images, 3D point clouds, audio, video, time series, tabular data) due to the inherent gaps among them. In this work, we propose a framework, named Meta-Transformer, that leverages a $\textbf{frozen}$ encoder to perform multimodal perception without any paired multimodal training data. In Meta-Transformer, the raw input data from various modalities are mapped into a shared token space, allowing a subsequent encoder with frozen parameters to extract high-level semantic features of the input data. Composed of three main components: a unified data tokenizer, a modality-shared encoder, and task-specific heads for downstream tasks, Meta-Transformer is the first framework to perform unified learning across 12 modalities with unpaired data. Experiments on different benchmarks reveal that Meta-Transformer can handle a wide range of tasks including fundamental perception (text, image, point cloud, audio, video), practical application (X-Ray, infrared, hyperspectral, and IMU), and data mining (graph, tabular, and time-series). Meta-Transformer indicates a promising future for developing unified multimodal intelligence with transformers. Code will be available at <a class="link-external link-https" href="https://github.com/invictus717/MetaTransformer" rel="external noopener nofollow">this https URL</a>

Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language,Machine Learning,Multimedia

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to design a unified network framework in multimodal learning that can handle many different types of modal data (such as natural language, 2D images, 3D point clouds, audio, video, time series, tabular data, etc.). Due to the significant differences among various modal data, designing a unified network that can handle these different modal data simultaneously is a challenging task. Existing methods usually require paired multimodal training data, and most research focuses on visual and language modalities and cannot share parameters across the entire encoder to handle data of other modalities. To solve the above problems, the paper proposes the **Meta - Transformer** framework, which utilizes frozen encoders to perform multimodal perception without paired multimodal training data. The main contributions of Meta - Transformer include: 1. **Proposing a new framework**: Meta - Transformer can use the same set of parameters to simultaneously extract representations from multiple modalities. It is the first framework that can handle 12 types of modal data simultaneously. 2. **Comprehensively analyzing the functions of Transformer components**: Meta - Transformer elaborately explores the roles of embedding, tokenization, and encoders in handling various modal data, providing valuable insights for developing modality - independent frameworks. 3. **Experimentally verifying the effectiveness of the framework**: Meta - Transformer performs excellently in various benchmarks for handling 12 types of modal data, verifying its potential in unified multimodal learning. Through these contributions, Meta - Transformer provides new directions and possibilities for the future development of unified multimodal intelligence.

Meta-Transformer: A Unified Framework for Multimodal Learning

Mmformer: Multimodal Medical Transformer for Incomplete Multimodal Learning of Brain Tumor Segmentation

Multimodal Token Fusion for Vision Transformers

Cross-Modal Meta-Knowledge Transfer: A Meta-Learning Framework Adaptable for Multimodal Tasks

High-Modality Multimodal Transformer: Quantifying Modality & Interaction Heterogeneity for High-Modality Representation Learning

Multimodal Transformer for Unaligned Multimodal Language Sequences

Multimodal Learning With Transformers: A Survey

TransModality: An End2End Fusion Method with Transformer for Multimodal Sentiment Analysis

Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities

Everything is a Video: Unifying Modalities through Next-Frame Prediction

CMMT: Cross-Modal Meta-Transformer for Video-Text Retrieval.

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

VL-Meta: Vision-Language Models for Multimodal Meta-Learning

Multi-scale Cooperative Multimodal Transformers for Multimodal Sentiment Analysis in Videos

UniTranSeR: A Unified Transformer Semantic Representation Framework for Multimodal Task-Oriented Dialog System

Generic Multimodal Gradient-based Meta Learner Framework

MMTrans-MT: A Framework for Multimodal Emotion Recognition Using Multitask Learning

Multimodal Transformer With Multi-View Visual Representation for Image Captioning

OmniDataComposer: A Unified Data Structure for Multimodal Data Fusion and Infinite Data Generation

UniTR: A Unified and Efficient Multi-Modal Transformer for Bird's-Eye-View Representation