ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities

Peng Wang,Shijie Wang,Junyang Lin,Shuai Bai,Xiaohuan Zhou,Jingren Zhou,Xinggang Wang,Chang Zhou
2023-05-19
Abstract:In this work, we explore a scalable way for building a general representation model toward unlimited modalities. We release ONE-PEACE, a highly extensible model with 4B parameters that can seamlessly align and integrate representations across vision, audio, and language modalities. The architecture of ONE-PEACE comprises modality adapters, shared self-attention layers, and modality FFNs. This design allows for the easy extension of new modalities by adding adapters and FFNs, while also enabling multi-modal fusion through self-attention layers. To pretrain ONE-PEACE, we develop two modality-agnostic pretraining tasks, cross-modal aligning contrast and intra-modal denoising contrast, which align the semantic space of different modalities and capture fine-grained details within modalities concurrently. With the scaling-friendly architecture and pretraining tasks, ONE-PEACE has the potential to expand to unlimited modalities. Without using any vision or language pretrained model for initialization, ONE-PEACE achieves leading results on a wide range of uni-modal and multi-modal tasks, including image classification (ImageNet), semantic segmentation (ADE20K), audio-text retrieval (AudioCaps, Clotho), audio classification (ESC-50, FSD50K, VGGSound), audio question answering (AVQA), image-text retrieval (MSCOCO, Flickr30K), and visual grounding (RefCOCO/+/g). Code is available at <a class="link-external link-https" href="https://github.com/OFA-Sys/ONE-PEACE" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Computation and Language,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to construct a general representation model that can be extended to infinite modalities. Specifically, the researchers proposed a model named ONE - PEACE, aiming to seamlessly align and integrate the representations among visual, audio and language modalities through a highly extensible design. This model has the following characteristics: 1. **Flexible architecture**: The architecture of ONE - PEACE must be flexible enough to adapt to various modalities and support multi - modal interaction. 2. **Modality - independent pre - training tasks**: Pre - training tasks need not only to extract information from each modality, but also ensure cross - modality alignment. 3. **General and simple pre - training tasks**: These tasks should be general and simple and can be applied to different modalities. To achieve the above goals, ONE - PEACE adopts an architecture that includes multiple modality adapters and a modality - fusion encoder. Each modality has an adapter for converting the original input into a feature sequence. The modality - fusion encoder is based on the Transformer architecture, and each Transformer block contains a shared self - attention layer and multiple modality feed - forward networks (FFNs). This design makes it possible to add a new modality simply by injecting the corresponding adapter and FFNs. During the pre - training stage, ONE - PEACE has designed two modality - independent pre - training tasks: - **Cross - modality contrastive learning**: It includes visual - language contrastive learning and audio - language contrastive learning, effectively aligning the semantic spaces of visual, audio and language modalities. - **Intra - modality denoising contrastive learning**: Combining mask prediction and contrastive learning, it enhances the fine - tuning performance of the model in downstream tasks by calculating the contrastive loss for fine - grained masked features and visible features. Through these designs, ONE - PEACE has not only achieved leading results in unimodal tasks (such as image classification, semantic segmentation) and multimodal tasks (such as audio - text retrieval, audio classification, audio question answering, image - text retrieval, visual localization), but also demonstrated its potential to be extended to infinite modalities.