Video Moment Retrieval with Cross-Modal Neural Architecture Search

Xun Yang, Shanshan Wang, Jian Dong, Jianfeng Dong, Meng Wang, Tat-Seng Chua
2022-01-11
Abstract:The task of video moment retrieval (VMR) is to retrieve the specific video moment from an untrimmed video, according to a textual query. It is a challenging task that requires effective modeling of complex cross-modal matching relationship. Recent efforts primarily model the cross-modal interactions by hand-crafted network architectures. Despite their effectiveness, they rely heavily on expert experience to select architectures and have numerous hyperparameters that need to be carefully tuned, which significantly limit their applications in real-world scenarios. How to design flexible architectures for modeling cross-modal interactions with less manual effort is crucial for the task of VMR but has received limited attention so far. To address this issue, we present a novel VMR approach that automatically searches for an optimal architecture to learn cross-modal matching relationship. Specifically, we develop a cross …
What problem does this paper attempt to address?