Abstract:We present ImageBind-LLM, a multi-modality instruction tuning method of large language models (LLMs) via ImageBind. Existing works mainly focus on language and image instruction tuning, different from which, our ImageBind-LLM can respond to multi-modality conditions, including audio, 3D point clouds, video, and their embedding-space arithmetic by only image-text alignment training. During training, we adopt a learnable bind network to align the embedding space between LLaMA and ImageBind's image encoder. Then, the image features transformed by the bind network are added to word tokens of all layers in LLaMA, which progressively injects visual instructions via an attention-free and zero-initialized gating mechanism. Aided by the joint embedding of ImageBind, the simple image-text training enables our model to exhibit superior multi-modality instruction-following capabilities. During inference, the multi-modality inputs are fed into the corresponding ImageBind encoders, and processed by a proposed visual cache model for further cross-modal embedding enhancement. The training-free cache model retrieves from three million image features extracted by ImageBind, which effectively mitigates the training-inference modality discrepancy. Notably, with our approach, ImageBind-LLM can respond to instructions of diverse modalities and demonstrate significant language generation quality. Code is released at <a class="link-external link-https" href="https://github.com/OpenGVLab/LLaMA-Adapter" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the problems of large - language models (LLMs) in multimodal instruction tuning. Specifically, existing work mainly focuses on language and image instruction tuning, while this paper proposes a new method - **ImageBind - LLM**, which can respond to inputs of multiple modalities, including audio, 3D point cloud, video, etc., and achieve this goal only through image - text alignment training. #### Main problems and challenges: 1. **Limitations of multimodal instruction tuning**: - Most of the existing methods can only handle instruction tuning of images and texts, and cannot effectively handle inputs of other modalities (such as audio, video, 3D point cloud). - How to develop an LLM that can handle multimodal instructions is still an under - explored problem. 2. **Modal differences in the training and inference stages**: - Usually only an image encoder is used during training, while in the inference stage, inputs of multiple modalities need to be processed, which may lead to performance degradation. 3. **The need for efficient tuning**: - Efficient multimodal instruction tuning needs to be achieved without significantly increasing computational resources. ### Solutions of ImageBind - LLM: 1. **Multimodal instruction response**: - ImageBind - LLM aligns data of different modalities through a joint embedding space, enabling the model to understand and respond to inputs of multiple modalities. 2. **Efficient tuning method**: - By freezing the image encoder of ImageBind and only fine - tuning some weights of LLaMA, combined with parameter - efficient tuning techniques (such as LoRA and bias - normalization tuning), efficient multimodal instruction tuning is achieved. 3. **Attention - free zero - initialization injection mechanism**: - Multimodal conditions are directly added to all word tokens of LLaMA, and a learnable zero - initialization gating mechanism is adopted, simplifying the injection process of visual knowledge. 4. **Cross - modal cache retrieval**: - A cache model based on image features extracted by ImageBind is constructed. In the inference stage, by retrieving similar visual features, multimodal embedding is enhanced, alleviating the modal differences in the training and inference stages. ### Summary: This paper solves the limitations of existing LLMs in multimodal instruction tuning by proposing ImageBind - LLM, provides an efficient and general method to handle inputs of multiple modalities, and further improves the performance of the model through cross - modal cache retrieval.

ImageBind-LLM: Multi-modality Instruction Tuning

LLMBind: A Unified Modality-Task Integration Framework

Point-Bind & Point-LLM: Aligning Point Cloud with Multi-modality for 3D Understanding, Generation, and Instruction Following

LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

UniBind: LLM-Augmented Unified and Balanced Representation Space to Bind Them All

TextBind: Multi-turn Interleaved Multimodal Instruction-following in the Wild

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models

Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

Instruction Tuning-free Visual Token Complement for Multimodal LLMs

StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data

CoMMIT: Coordinated Instruction Tuning for Multimodal Large Language Models

InfMLLM: A Unified Framework for Visual-Language Tasks.

LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

BayLing: Bridging Cross-lingual Alignment and Instruction Following through Interactive Translation for Large Language Models