Joint learning of video scene detection and annotation via multi-modal adaptive context network
Yifei Xu,Litong Pan,Weiguang Sang,HaiLun Luo,Li Li,Pingping Wei,Li Zhu
DOI: https://doi.org/10.1016/j.eswa.2024.123656
IF: 8.5
2024-04-03
Expert Systems with Applications
Abstract:The tasks of scene detection and annotation have gained impressive attention for understanding video content. The main challenges lie in mitigating the error propagation of shot detection, recognizing cuts and gradual transitions, fusing hierarchical multi-modal cues, and solving these two tasks simultaneously. To address these challenges, we propose the Multi-modal Adaptive Context Network (MACN) to jointly learn scene detection and annotation from a window partitioning perspective. As a shared task-agnostic part, we perform Window-based Cross-modal Representation (WCR) to distill complex semantic correlations from multi-modal sources for each window. Considering the long-term temporal dependency of variable-length scenes, we further develop Adaptive Context-aware Representation (ACR) to improve the performance for specific tasks. Different from previous works, scene detection is formulated as locating the starting window and its associated location offset and transition duration. Meanwhile, we assemble two multi-label sub-classifiers in different levels to predict the labels for each scene candidate. Experimental comparisons to state-of-the-art algorithms on the TAVS and ClipShots indicate that the proposed method yields promising performance in both tasks. Our code and test sample videos are released at MACN .
computer science, artificial intelligence,engineering, electrical & electronic,operations research & management science