Abstract:We propose a new task, dataset and model for grounded video caption generation. This task unifies captioning and object grounding in video, where the objects in the caption are grounded in the video via temporally consistent bounding boxes. We introduce the following contributions. First, we present a task definition and a manually annotated test dataset for this task, referred to as GROunded Video Caption Generation (GROC). Second, we introduce a large-scale automatic annotation method leveraging an existing model for grounded still image captioning together with an LLM for summarising frame-level captions into temporally consistent captions in video. Furthermore, we prompt the LLM to track by language -- classifying noun phrases from the frame-level captions into noun phrases of the video-level generated caption. We apply this approach to videos from the HowTo100M dataset, which results in a new large-scale training dataset, called HowToGround, with automatically annotated captions and spatio-temporally consistent bounding boxes with coherent natural language labels. Third, we introduce a new grounded video caption generation model, called VideoGround, and train the model on the new automatically annotated HowToGround dataset. Finally, results of our VideoGround model set the state of the art for the new task of grounded video caption generation. We perform extensive ablations and demonstrate the importance of key technical contributions of our model.

What problem does this paper attempt to address?

### The Problem the Paper Aims to Solve This paper aims to address a new task in video caption generation—Grounded Video Caption Generation. Specifically, this task not only requires generating natural language text that describes the video content but also demands annotating the objects mentioned in the text with temporally consistent bounding boxes in the video. This involves two main challenges: 1. **Generating high-quality video descriptions**: Generating natural language text that accurately describes the video content. 2. **Temporal consistency in object annotation**: Ensuring that the objects mentioned in the text are correctly annotated across different frames in the video, and that these annotations are temporally consistent. ### Background and Motivation Currently, significant progress has been made in the field of multimodal video understanding, especially with the help of large language models (LLMs). However, existing research mostly focuses on generating video-level descriptions or locating specific moments in the video, with less attention on spatiotemporal annotation of objects mentioned in the descriptions. Such spatiotemporal annotation is crucial for the development of fields like human-computer interaction and embodied perception, but there is a lack of suitable annotated datasets and dedicated models to support this task. ### Main Contributions 1. **Task Definition and Test Dataset**: The authors propose a new task—Grounded Video Caption Generation (GROC), and create a test dataset containing 1000 manually annotated videos. 2. **Large-scale Automatic Annotation Method**: To overcome the lack of training data, the authors propose a large-scale automatic annotation method using existing image annotation models and LLMs. This method can generate automatically annotated video descriptions and temporally consistent bounding boxes from the HowTo100M dataset, forming a new large-scale training dataset—HowToGround. 3. **New Model**: The authors introduce a new model—VideoGround, specifically designed for the grounded video caption generation task. Key technical innovations of this model include: - **Spatiotemporal Adapter**: Efficiently modeling spatiotemporal information in videos. - **Bounding Box Decoder**: Generating temporally consistent bounding boxes using pre-trained weights. - **Temporal Objectness Head**: Explicitly modeling the appearance and disappearance of objects in the video. 4. **Experimental Results**: The authors validate the importance of each part of the model through extensive ablation experiments and demonstrate the state-of-the-art performance of the VideoGround model on the new task. ### Summary By defining a new task, creating a test dataset, proposing a large-scale automatic annotation method, and designing a new model, this paper systematically addresses the problem of grounded video caption generation. These contributions not only advance research in the field of multimodal video understanding but also provide strong support for practical applications.

Grounded Video Caption Generation

Grounded Video Description

Learning Comprehensive Visual Grounding for Video Captioning

Learning Visual Grounding from Generative Vision and Language Model

Comprehensive Visual Grounding for Video Description

Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models

Top-down framework for weakly-supervised grounded image captioning

DEBUG: A Dense Bottom-Up Grounding Approach for Natural Language Video Localization.

Grounded Video Situation Recognition

Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding

What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions

GLIGEN: Open-Set Grounded Text-to-Image Generation

Generating Descriptions with Grounded and Co-Referenced People

GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection

A Closer Look at Temporal Sentence Grounding in Videos: Dataset and Metric

Betrayed by Captions: Joint Caption Grounding and Generation for Open Vocabulary Instance Segmentation

GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval

LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models

Desipramine side-effect.

Dense Video Object Captioning from Disjoint Supervision

SynopGround: A Large-Scale Dataset for Multi-Paragraph Video Grounding from TV Dramas and Synopses