Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Hang Zhang,Xin Li,Lidong Bing
2023-10-25
Abstract:We present Video-LLaMA a multi-modal framework that empowers Large Language Models (LLMs) with the capability of understanding both visual and auditory content in the video. Video-LLaMA bootstraps cross-modal training from the frozen pre-trained visual and audio encoders and the frozen LLMs. Unlike previous works that complement LLMs to process the visual or audio signals only, Video-LLaMA enables video comprehension by tackling two challenges: (1) capturing the temporal changes in visual scenes, (2) integrating audio-visual signals. To counter the first challenge, we propose a Video Q-former to assemble a pre-trained image encoder into our video encoder and introduce a video-to-text generation task to learn video-language correspondence. For the second challenge, we leverage ImageBind, a universal embedding model aligning multiple modalities, as the pre-trained audio encoder and introduce an Audio Q-former on top of ImageBind to learn reasonable auditory query embeddings for the LLM module. To align the output of both visual and audio encoders with LLM's embedding space, we first train Video-LLaMA on massive video/image-caption pairs and then tune our model with visual-instruction datasets of moderate amount but higher quality. We found Video-LLaMA shows the ability to perceive and comprehend video content and generate meaningful responses grounded in the visual and auditory information presented in the videos.
Computation and Language,Computer Vision and Pattern Recognition,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The paper aims to address the issue of how to enable large language models (LLMs) to understand both visual and auditory content in videos. Specifically, the researchers propose a multimodal framework called Video-LLaMA, which allows LLMs to not only understand textual information but also comprehend visual scene changes and audio signals in videos, and generate meaningful responses based on these understandings. Compared to previous research, the unique aspect of Video-LLaMA is its ability to simultaneously process both visual and auditory information in videos, rather than just one modality. To achieve this goal, the research team designed an architecture with two branches: the Vision-Language Branch and the Audio-Language Branch. The Vision-Language Branch is responsible for processing video frames and extracting visual features, while the Audio-Language Branch handles audio signals. Additionally, the research proposes a multi-branch cross-modal pre-training method, which involves pre-training on large-scale video-text paired data and then fine-tuning on high-quality instruction-following datasets to enhance the model's ability to understand videos and follow instructions. In summary, the main contribution of this research is the introduction of a new framework, Video-LLaMA, which enables LLMs to make significant progress in understanding video content, particularly in the simultaneous processing of visual and auditory information in videos.