Enhanced Cross-Modal Transformer Model for Video Semantic Similarity Measurement

Da Li,Boqing Zhu,Kele Xu,Sen Yang,Dawei Feng,Bo Liu,Huaimin Wang
DOI: https://doi.org/10.1109/tcsii.2023.3302801
2024-01-01
Abstract:Video processing is critical to many industrial systems. Semantic similarity measures for videos aim to evaluate the semantic distance of videos, and its downstream applications seem obvious, such as deduplication, related matching, ranking and content diversity control. Despite sustainable efforts, many previous approaches require large-scale human annotation for the training purpose, which can be difficult to obtain in practical settings. Moreover, previous methods struggle to characterize the interactions between multiple modalities, which severely constraints the performance. To address aforementioned challenges, in this brief, we introduce a novel framework to measure the semantic similarity of the video. Specifically, to address the limited annotated datasets, we firstly propose a pre-training paradigm leveraging weakly-supervised label classification. Moreover, our approach proposes an Enhanced Cross-modal Transformer (ECT) block, to fully utilize the interaction information between video and textural features. Various optimization strategies have also been proposed to improve the performance of the model. Despite its conceptually simpleness, extensive experiments demonstrate the effectiveness of the proposed approach on the semantic similarity measurement tasks.
What problem does this paper attempt to address?