OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces

Zehan Wang,Ziang Zhang,Hang Zhang,Luping Liu,Rongjie Huang,Xize Cheng,Hengshuang Zhao,Zhou Zhao

2024-07-17

Abstract:Recently, human-computer interaction with various modalities has shown promising applications, like GPT-4o and Gemini. Given the foundational role of multimodal joint representation in understanding and generation pipelines, high-quality omni joint representations would be a step toward co-processing more diverse multimodal information. In this work, we present OmniBind, large-scale multimodal joint representation models ranging in scale from 7 billion to 30 billion parameters, which support 3D, audio, image, and language inputs. Due to the scarcity of data pairs across all modalities, instead of training large models from scratch, we propose remapping and binding the spaces of various pre-trained specialist models together. This approach enables "scaling up" by indirectly increasing the model parameters and the amount of seen data. To effectively integrate various spaces, we dynamically assign weights to different spaces by learning routers with two objectives: cross-modal overall alignment and language representation decoupling. Notably, since binding and routing spaces both only require lightweight networks, OmniBind is extremely training-efficient. Learning the largest 30B model requires merely unpaired unimodal data and approximately 3 days on a single 8-4090 node. Extensive experiments demonstrate the versatility and superiority of OmniBind as an omni representation model, highlighting its great potential for diverse applications, such as any-query and composable multimodal understanding.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper aims to address the challenges of large-scale multimodal representation models, particularly when dealing with multiple modality combinations. Specifically, the paper proposes OmniBind, a large-scale multimodal joint representation model that supports 3D, audio, image, and language inputs. OmniBind achieves "scale expansion" by spatially binding different pre-trained models together, indirectly increasing the model's parameter count and the amount of data seen. Since training a large-scale model from scratch that covers all modalities faces data scarcity issues, OmniBind adopts a more efficient approach by constructing the model through binding existing spaces. The main contributions of the paper include: 1. Proposing the OmniBind model, with parameter counts ranging from 7 billion to 30 billion, capable of handling four modalities (3D point clouds, audio, images, and language). 2. Introducing a routing strategy to integrate spaces pre-trained on different modalities and datasets, thereby mitigating interference between knowledge from different sources and further enhancing the model's generality. 3. Designing two learning objectives: cross-modal holistic alignment and language representation decoupling, to guide the learning process of routing, enabling it to dynamically predict the optimal combination weights for all modality combinations while preserving the distinctiveness of representations. 4. Excelling in 13 benchmarks covering all modality combinations and demonstrating great potential in various applications, such as 3D-audio retrieval and arbitrary query separation/localization, while requiring fewer computational resources and data.

OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces

UniBind: LLM-Augmented Unified and Balanced Representation Space to Bind Them All

FreeBind: Free Lunch in Unified Multimodal Space via Knowledge Fusion

OmniBench: Towards The Future of Universal Omni-Language Models

MolBind: Multimodal Alignment of Language, Molecules, and Proteins

Anchors Aweigh! Sail for Optimal Unified Multi-Modal Representations

OmniBal: Towards Fast Instruct-tuning for Vision-Language Models via Omniverse Computation Balance

Point-Bind & Point-LLM: Aligning Point Cloud with Multi-modality for 3D Understanding, Generation, and Instruction Following

NeuroBind: Towards Unified Multimodal Representations for Neural Signals

4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities

LLMBind: A Unified Modality-Task Integration Framework

GEOBIND: Binding Text, Image, and Audio through Satellite Images

OmniDataComposer: A Unified Data Structure for Multimodal Data Fusion and Infinite Data Generation

Beyond First Impressions: Integrating Joint Multi-modal Cues for Comprehensive 3D Representation

EventBind: Learning a Unified Representation to Bind Them All for Event-based Open-world Understanding

OpenOmni: A Collaborative Open Source Tool for Building Future-Ready Multimodal Conversational Agents

MEDBind: Unifying Language and Multimodal Medical Data Embeddings

MMBind: Unleashing the Potential of Distributed and Heterogeneous Data for Multimodal Learning in IoT

OmnixR: Evaluating Omni-modality Language Models on Reasoning across Modalities

TextBind: Multi-turn Interleaved Multimodal Instruction-following in the Wild