GEM-VPC: A dual Graph-Enhanced Multimodal integration for Video Paragraph Captioning

Eileen Wang,Caren Han,Josiah Poon
2024-10-12
Abstract:Video Paragraph Captioning (VPC) aims to generate paragraph captions that summarises key events within a video. Despite recent advancements, challenges persist, notably in effectively utilising multimodal signals inherent in videos and addressing the long-tail distribution of words. The paper introduces a novel multimodal integrated caption generation framework for VPC that leverages information from various modalities and external knowledge bases. Our framework constructs two graphs: a 'video-specific' temporal graph capturing major events and interactions between multimodal information and commonsense knowledge, and a 'theme graph' representing correlations between words of a specific theme. These graphs serve as input for a transformer network with a shared encoder-decoder architecture. We also introduce a node selection module to enhance decoding efficiency by selecting the most relevant nodes from the graphs. Our results demonstrate superior performance across benchmark datasets.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to solve several key problems in Video Paragraph Captioning (VPC): 1. **Effective utilization of multi - modal signals**: Most existing VPC works mainly rely on visual information to generate captions, ignoring the rich other modal signals in videos (such as audio, text, etc.). These additional modal information can provide important clues for video understanding. 2. **Long - tailed word distribution problem**: In the training data, some words appear less frequently, causing the model to be easily over - fitted to high - frequency words and ignoring low - frequency but important objects, attributes or behaviors. This will affect the diversity and accuracy of the generated captions. 3. **Long - sequence processing and context selection**: Existing methods directly embed video features into the caption - generation model, making it difficult for the model to effectively process long sequences and select relevant context from long - time input streams. To address these problems, the author introduced a new multi - modal integrated caption - generation framework GEM - VPC (Graph - Enhanced Multimodal Video Paragraph Captioning), which improves the VPC task in the following ways: - **Construct two graph structures**: - **Video - Specific Graph (VG)**: Captures the main events in the video and their temporal order, and represents the interaction between different modal information and common - sense knowledge. - **Theme Graph (TG)**: Represents the association relationship between words related to a specific theme, providing corpus - level information. - **Node selection module**: In order to improve decoding efficiency, select the most relevant nodes for decoding, thereby reducing the influence of noise information. - **Introduction of external common - sense knowledge**: Extract language labels by using pre - trained action/audio/object recognition models and text parsers, and combine language features from external knowledge sources to enhance the information richness of the graph structure. Through these improvements, GEM - VPC has demonstrated superior performance on multiple benchmark datasets, especially in handling multi - modal information and alleviating the long - tailed word distribution problem. ### Formula summary 1. **Normalized Point - wise Mutual Information (NPMI) calculation formula**: \[ PMI(i, j)=\log\frac{p(i, j)}{p(i)p(j)} \] \[ NPMI = \frac{PMI}{-\log(p(i, j))} \] where \(p(i, j)=\frac{\#S(i, j)}{\#S}\), \(p(i)=\frac{\#S(i)}{\#S}\), \(p(j)=\frac{\#S(j)}{\#S}\), \(\#S(i)\) is the number of sentences containing word \(i\), \(\#S(i, j)\) is the number of sentences containing both word \(i\) and \(j\) simultaneously, and \(\#S\) is the total number of sentences in the corpus. 2. **Multi - Head Self - Attention mechanism (MHA) formula**: \[ MHA(Q, K, V)=\text{softmax}\left(\frac{QK^{T}}{\sqrt{d_{k}}}+M\right)V \] where \(Q = XW_{Q}\), \(K = XW_{K}\), \(V = XW_{V}\), \(W_{Q}, W_{K}, W_{V}\) are learnable parameters, \(X = F_{VC}\), \(M\) is a mask matrix used to prevent the model from paying attention to future words. Through these methods, GEM - VPC can more effectively integrate multi - modal information and generate high - quality video paragraph captions.