Bridging Asymmetry Between Image and Video: Cross-modality Knowledge Transfer Based on Learning from Video

Bingxin Zhou,Jianghao Zhou,Zhongming Chen,Ziqiang Li,Long Deng,Yongxin Ge
DOI: https://doi.org/10.1016/j.eswa.2024.125873
IF: 8.5
2024-01-01
Expert Systems with Applications
Abstract:In the domain of activity-based image-to-video retrieval, dynamically consistent semantics are crucial for effective cross-modal search tasks. Existing methods face significant challenges, particularly in addressing the issue of modality asymmetry, where images and videos exhibit differing semantic representations. A key solution to this challenge lies in enhancing the learning capacity of the image encoder by leveraging knowledge from video data. To this end, we propose a Cross-Modal Knowledge Transfer (CMKT) framework that improves the behavior modeling capability of the image encoder. This enhancement is achieved through both global and local information transmission: globally, the model assimilates rich semantic information from videos across a broad temporal spectrum, while locally, it captures semantics from frames closely resembling the query image. Specifically, we design the Global Temporal Structure Transmission (GTST) Model to ensure temporal distribution consistency between query-image objects and video content. Additionally, the Local Temporal Relation Enhancement (LRTE) Module is introduced to pinpoint the most relevant action information within the video. We evaluate the effectiveness of our method on two widely adopted action recognition datasets, THUMOS14 and ActivityNet, and provide comprehensive ablation studies to substantiate the efficacy of our approach.
What problem does this paper attempt to address?