Abstract:Video Paragraph Captioning (VPC) aims to generate paragraph captions that summarises key events within a video. Despite recent advancements, challenges persist, notably in effectively utilising multimodal signals inherent in videos and addressing the long-tail distribution of words. The paper introduces a novel multimodal integrated caption generation framework for VPC that leverages information from various modalities and external knowledge bases. Our framework constructs two graphs: a 'video-specific' temporal graph capturing major events and interactions between multimodal information and commonsense knowledge, and a 'theme graph' representing correlations between words of a specific theme. These graphs serve as input for a transformer network with a shared encoder-decoder architecture. We also introduce a node selection module to enhance decoding efficiency by selecting the most relevant nodes from the graphs. Our results demonstrate superior performance across benchmark datasets.

What problem does this paper attempt to address?

This paper attempts to solve several key problems in Video Paragraph Captioning (VPC): 1. **Effective utilization of multi - modal signals**: Most existing VPC works mainly rely on visual information to generate captions, ignoring the rich other modal signals in videos (such as audio, text, etc.). These additional modal information can provide important clues for video understanding. 2. **Long - tailed word distribution problem**: In the training data, some words appear less frequently, causing the model to be easily over - fitted to high - frequency words and ignoring low - frequency but important objects, attributes or behaviors. This will affect the diversity and accuracy of the generated captions. 3. **Long - sequence processing and context selection**: Existing methods directly embed video features into the caption - generation model, making it difficult for the model to effectively process long sequences and select relevant context from long - time input streams. To address these problems, the author introduced a new multi - modal integrated caption - generation framework GEM - VPC (Graph - Enhanced Multimodal Video Paragraph Captioning), which improves the VPC task in the following ways: - **Construct two graph structures**: - **Video - Specific Graph (VG)**: Captures the main events in the video and their temporal order, and represents the interaction between different modal information and common - sense knowledge. - **Theme Graph (TG)**: Represents the association relationship between words related to a specific theme, providing corpus - level information. - **Node selection module**: In order to improve decoding efficiency, select the most relevant nodes for decoding, thereby reducing the influence of noise information. - **Introduction of external common - sense knowledge**: Extract language labels by using pre - trained action/audio/object recognition models and text parsers, and combine language features from external knowledge sources to enhance the information richness of the graph structure. Through these improvements, GEM - VPC has demonstrated superior performance on multiple benchmark datasets, especially in handling multi - modal information and alleviating the long - tailed word distribution problem. ### Formula summary 1. **Normalized Point - wise Mutual Information (NPMI) calculation formula**: \[ PMI(i, j)=\log\frac{p(i, j)}{p(i)p(j)} \] \[ NPMI = \frac{PMI}{-\log(p(i, j))} \] where \(p(i, j)=\frac{\#S(i, j)}{\#S}\), \(p(i)=\frac{\#S(i)}{\#S}\), \(p(j)=\frac{\#S(j)}{\#S}\), \(\#S(i)\) is the number of sentences containing word \(i\), \(\#S(i, j)\) is the number of sentences containing both word \(i\) and \(j\) simultaneously, and \(\#S\) is the total number of sentences in the corpus. 2. **Multi - Head Self - Attention mechanism (MHA) formula**: \[ MHA(Q, K, V)=\text{softmax}\left(\frac{QK^{T}}{\sqrt{d_{k}}}+M\right)V \] where \(Q = XW_{Q}\), \(K = XW_{K}\), \(V = XW_{V}\), \(W_{Q}, W_{K}, W_{V}\) are learnable parameters, \(X = F_{VC}\), \(M\) is a mask matrix used to prevent the model from paying attention to future words. Through these methods, GEM - VPC can more effectively integrate multi - modal information and generate high - quality video paragraph captions.

GEM-VPC: A dual Graph-Enhanced Multimodal integration for Video Paragraph Captioning

Towards Multimodal Video Paragraph Captioning Models Robust to Missing Modality

Visual Oriented Encoder: Integrating Multimodal and Multi-Scale Contexts for Video Captioning

Visual Commonsense-Aware Representation Network for Video Captioning

Concept Parser with Multimodal Graph Learning for Video Captioning

Multimodality-guided Visual-Caption Semantic Enhancement

MIVCN: Multimodal interaction video captioning network based on semantic association graph

CMGNet: Collaborative multi-modal graph network for video captioning

Cap4Video++: Enhancing Video Understanding with Auxiliary Captions

Multimodal-enhanced hierarchical attention network for video captioning

Lightweight dense video captioning with cross-modal attention and knowledge-enhanced unbiased scene graph

Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions

Context-Aware Visual Policy Network for Fine-Grained Image Captioning

Discriminative Latent Semantic Graph for Video Captioning

Enhanced Video Caption Generation Based on Multimodal Features.

Event-centric multi-modal fusion method for dense video captioning

Multi-Modal interpretable automatic video captioning

Video Captioning with Guidance of Multimodal Latent Topics

Video Captioning with Aggregated Features Based on Dual Graphs and Gated Fusion

Edit As You Wish: Video Caption Editing with Multi-grained User Control

Integrating both Visual and Audio Cues for Enhanced Video Caption