Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Hang Zhang,Xin Li,Lidong Bing

2023-10-25

Abstract:We present Video-LLaMA a multi-modal framework that empowers Large Language Models (LLMs) with the capability of understanding both visual and auditory content in the video. Video-LLaMA bootstraps cross-modal training from the frozen pre-trained visual and audio encoders and the frozen LLMs. Unlike previous works that complement LLMs to process the visual or audio signals only, Video-LLaMA enables video comprehension by tackling two challenges: (1) capturing the temporal changes in visual scenes, (2) integrating audio-visual signals. To counter the first challenge, we propose a Video Q-former to assemble a pre-trained image encoder into our video encoder and introduce a video-to-text generation task to learn video-language correspondence. For the second challenge, we leverage ImageBind, a universal embedding model aligning multiple modalities, as the pre-trained audio encoder and introduce an Audio Q-former on top of ImageBind to learn reasonable auditory query embeddings for the LLM module. To align the output of both visual and audio encoders with LLM's embedding space, we first train Video-LLaMA on massive video/image-caption pairs and then tune our model with visual-instruction datasets of moderate amount but higher quality. We found Video-LLaMA shows the ability to perceive and comprehend video content and generate meaningful responses grounded in the visual and auditory information presented in the videos.

Computation and Language,Computer Vision and Pattern Recognition,Sound,Audio and Speech Processing

What problem does this paper attempt to address?

The paper aims to address the issue of how to enable large language models (LLMs) to understand both visual and auditory content in videos. Specifically, the researchers propose a multimodal framework called Video-LLaMA, which allows LLMs to not only understand textual information but also comprehend visual scene changes and audio signals in videos, and generate meaningful responses based on these understandings. Compared to previous research, the unique aspect of Video-LLaMA is its ability to simultaneously process both visual and auditory information in videos, rather than just one modality. To achieve this goal, the research team designed an architecture with two branches: the Vision-Language Branch and the Audio-Language Branch. The Vision-Language Branch is responsible for processing video frames and extracting visual features, while the Audio-Language Branch handles audio signals. Additionally, the research proposes a multi-branch cross-modal pre-training method, which involves pre-training on large-scale video-text paired data and then fine-tuning on high-quality instruction-following datasets to enhance the model's ability to understand videos and follow instructions. In summary, the main contribution of this research is the introduction of a new framework, Video-LLaMA, which enables LLMs to make significant progress in understanding video content, particularly in the simultaneous processing of visual and auditory information in videos.

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Audio-Visual LLM for Video Understanding

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

VideoLLM: Modeling Video Sequence with Large Language Models

PG-Video-LLaVA: Pixel Grounding Large Video-Language Models

VideoLLM-online: Online Video Large Language Model for Streaming Video

TC-LLaVA: Rethinking the Transfer from Image to Video Understanding with Temporal Considerations

LLaViLo: Boosting Video Moment Retrieval Via Adapter-Based Multimodal Modeling

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

LLMs Meet Long Video: Advancing Long Video Comprehension with an Interactive Visual Adapter in LLMs.

TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models

video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models

u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model

LLMs Meet Long Video: Advancing Long Video Question Answering with An Interactive Visual Adapter in LLMs

ST-LLM: Large Language Models Are Effective Temporal Learners

Video Understanding with Large Language Models: A Survey

Video-CCAM: Enhancing Video-Language Understanding with Causal Cross-Attention Masks for Short and Long Videos

Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models

Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization