Multimodality-guided Visual-Caption Semantic Enhancement
Nan Che,Jiang Liu,Fei Yu,Lechao Cheng,Yuxuan Wang,Yuehua Li,Chenrui Liu
DOI: https://doi.org/10.1016/j.cviu.2024.104139
IF: 4.886
2024-09-25
Computer Vision and Image Understanding
Abstract:Video captions generated with single modality, e.g. video clips, often suffer from insufficient event discovery and inadequate scene description. Therefore, this paper aims to improve the quality of captions by addressing these issues through the integration of multi-modal information. Specifically, We first construct a multi-modal dataset and introduce the triplet annotations of video, audio and text, fostering a comprehensive exploration about the associations between different modalities. Build upon this, We propose to explore the collaborative perception of audio and visual concepts to mitigate inaccuracies and incompleteness in captions in vision-based benchmarks by incorporating audio-visual perception priors. To achieve this, we extract effective semantic features from visual and auditory modalities, bridge the semantic gap between audio-visual modalities and text, and form a more precise knowledge graph multimodal coherence checking and information pruning mechanism. Exhaustive experiments demonstrate that the proposed approach surpasses existing methods and generalizes well with the assistance of ChatGPT.
computer science, artificial intelligence,engineering, electrical & electronic