ImageBind-LLM: Multi-modality Instruction Tuning

Jiaming Han,Renrui Zhang,Wenqi Shao,Peng Gao,Peng Xu,Han Xiao,Kaipeng Zhang,Chris Liu,Song Wen,Ziyu Guo,Xudong Lu,Shuai Ren,Yafei Wen,Xiaoxin Chen,Xiangyu Yue,Hongsheng Li,Yu Qiao
2023-09-12
Abstract:We present ImageBind-LLM, a multi-modality instruction tuning method of large language models (LLMs) via ImageBind. Existing works mainly focus on language and image instruction tuning, different from which, our ImageBind-LLM can respond to multi-modality conditions, including audio, 3D point clouds, video, and their embedding-space arithmetic by only image-text alignment training. During training, we adopt a learnable bind network to align the embedding space between LLaMA and ImageBind's image encoder. Then, the image features transformed by the bind network are added to word tokens of all layers in LLaMA, which progressively injects visual instructions via an attention-free and zero-initialized gating mechanism. Aided by the joint embedding of ImageBind, the simple image-text training enables our model to exhibit superior multi-modality instruction-following capabilities. During inference, the multi-modality inputs are fed into the corresponding ImageBind encoders, and processed by a proposed visual cache model for further cross-modal embedding enhancement. The training-free cache model retrieves from three million image features extracted by ImageBind, which effectively mitigates the training-inference modality discrepancy. Notably, with our approach, ImageBind-LLM can respond to instructions of diverse modalities and demonstrate significant language generation quality. Code is released at <a class="link-external link-https" href="https://github.com/OpenGVLab/LLaMA-Adapter" rel="external noopener nofollow">this https URL</a>.
Multimedia,Computation and Language,Computer Vision and Pattern Recognition,Machine Learning,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the problems of large - language models (LLMs) in multimodal instruction tuning. Specifically, existing work mainly focuses on language and image instruction tuning, while this paper proposes a new method - **ImageBind - LLM**, which can respond to inputs of multiple modalities, including audio, 3D point cloud, video, etc., and achieve this goal only through image - text alignment training. #### Main problems and challenges: 1. **Limitations of multimodal instruction tuning**: - Most of the existing methods can only handle instruction tuning of images and texts, and cannot effectively handle inputs of other modalities (such as audio, video, 3D point cloud). - How to develop an LLM that can handle multimodal instructions is still an under - explored problem. 2. **Modal differences in the training and inference stages**: - Usually only an image encoder is used during training, while in the inference stage, inputs of multiple modalities need to be processed, which may lead to performance degradation. 3. **The need for efficient tuning**: - Efficient multimodal instruction tuning needs to be achieved without significantly increasing computational resources. ### Solutions of ImageBind - LLM: 1. **Multimodal instruction response**: - ImageBind - LLM aligns data of different modalities through a joint embedding space, enabling the model to understand and respond to inputs of multiple modalities. 2. **Efficient tuning method**: - By freezing the image encoder of ImageBind and only fine - tuning some weights of LLaMA, combined with parameter - efficient tuning techniques (such as LoRA and bias - normalization tuning), efficient multimodal instruction tuning is achieved. 3. **Attention - free zero - initialization injection mechanism**: - Multimodal conditions are directly added to all word tokens of LLaMA, and a learnable zero - initialization gating mechanism is adopted, simplifying the injection process of visual knowledge. 4. **Cross - modal cache retrieval**: - A cache model based on image features extracted by ImageBind is constructed. In the inference stage, by retrieving similar visual features, multimodal embedding is enhanced, alleviating the modal differences in the training and inference stages. ### Summary: This paper solves the limitations of existing LLMs in multimodal instruction tuning by proposing ImageBind - LLM, provides an efficient and general method to handle inputs of multiple modalities, and further improves the performance of the model through cross - modal cache retrieval.