Abstract:We explore a new task for audio-visual-language modeling called fine-grained audible video description (FAVD). It aims to provide detailed textual descriptions for the given audible videos, including the appearance and spatial locations of each object, the actions of moving objects, and the sounds in videos. Existing visual-language modeling tasks often concentrate on visual cues in videos while undervaluing the language and audio modalities. On the other hand, FAVD requires not only audio-visual-language modeling skills but also paragraph-level language generation abilities. We construct the first fine-grained audible video description benchmark (FAVDBench) to facilitate this research. For each video clip, we first provide a one-sentence summary of the video, ie, the caption, followed by 4-6 sentences describing the visual details and 1-2 audio-related descriptions at the end. The descriptions are provided in both English and Chinese. We create two new metrics for this task: an EntityScore to gauge the completeness of entities in the visual descriptions, and an AudioScore to assess the audio descriptions. As a preliminary approach to this task, we propose an audio-visual-language transformer that extends existing video captioning model with an additional audio branch. We combine the masked language modeling and auto-regressive language modeling losses to optimize our model so that it can produce paragraph-level descriptions. We illustrate the efficiency of our model in audio-visual-language modeling by evaluating it against the proposed benchmark using both conventional captioning metrics and our proposed metrics. We further put our benchmark to the test in video generation models, demonstrating that employing fine-grained video descriptions can create more intricate videos than using captions. Code and dataset are available at https://github.com/OpenNLPLab/FAVDBench. Our online benchmark is available at www.avlbench.opennlplab.cn.

A CLIP-Enhanced Method for Video-Language Understanding

CLIPVQA:Video Quality Assessment via CLIP

VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation

CLIP4Caption ++: Multi-CLIP for Video Caption

VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Alignment

How Much Can CLIP Benefit Vision-and-Language Tasks?

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment

Verbs in Action: Improving verb understanding in video-language models

VideoCLIP-XL: Advancing Long Description Understanding for Video CLIP Models

Cap4Video++: Enhancing Video Understanding with Auxiliary Captions

Building an Open-Vocabulary Video CLIP Model With Better Architectures, Optimization and Data

ViLT-CLIP: Video and Language Tuning CLIP with Multimodal Prompt Learning and Scenario-Guided Optimization

CLIP4Caption: CLIP for Video Caption

CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning

XtremeCLIP: Extremely Parameter-efficient Tuning for Low-resource Vision Language Understanding

Fine-grained Audible Video Description

RTQ: Rethinking Video-language Understanding Based on Image-text Model

VadCLIP: Adapting Vision-Language Models for Weakly Supervised Video Anomaly Detection

FiGCLIP: Fine-Grained CLIP Adaptation via Densely Annotated Videos

CLIP-VAD: Exploiting Vision-Language Models for Voice Activity Detection