UniMD: Towards Unifying Moment Retrieval and Temporal Action Detection

Yingsen Zeng,Yujie Zhong,Chengjian Feng,Lin Ma
2024-07-11
Abstract:Temporal Action Detection (TAD) focuses on detecting pre-defined actions, while Moment Retrieval (MR) aims to identify the events described by open-ended natural language within untrimmed videos. Despite that they focus on different events, we observe they have a significant connection. For instance, most descriptions in MR involve multiple actions from TAD. In this paper, we aim to investigate the potential synergy between TAD and MR. Firstly, we propose a unified architecture, termed Unified Moment Detection (UniMD), for both TAD and MR. It transforms the inputs of the two tasks, namely actions for TAD or events for MR, into a common embedding space, and utilizes two novel query-dependent decoders to generate a uniform output of classification score and temporal segments. Secondly, we explore the efficacy of two task fusion learning approaches, pre-training and co-training, in order to enhance the mutual benefits between TAD and MR. Extensive experiments demonstrate that the proposed task fusion learning scheme enables the two tasks to help each other and outperform the separately trained counterparts. Impressively, UniMD achieves state-of-the-art results on three paired datasets Ego4D, Charades-STA, and ActivityNet. Our code is available at <a class="link-external link-https" href="https://github.com/yingsen1/UniMD" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address two related but currently independently handled problems: Temporal Action Detection (TAD) and Moment Retrieval (MR). Specifically, the goal of the paper is to explore the potential synergies between these two tasks and improve overall performance by integrating them. 1. **Unified Framework**: The authors propose a unified architecture called Unified Moment Detection (UniMD) that can handle both TAD and MR tasks simultaneously. The core of this framework lies in transforming the inputs of both tasks into a common embedding space and using a novel query-dependent decoder to generate unified outputs of classification scores and temporal segments. 2. **Task Fusion Learning**: The paper also explores two task fusion learning methods—pre-training and co-training—to enhance the mutual influence between TAD and MR. Specifically, two co-training methods are studied: synchronous task sampling and alternating task sampling, with the former proving to effectively improve the performance of each task. 3. **Experimental Results**: Extensive experiments on three paired datasets (Ego4D, Charades-STA, and ActivityNet) demonstrate that UniMD achieves state-of-the-art results on multiple metrics. Particularly, the performance under different amounts of training data shows that even with less data, the co-training method can achieve better performance than models trained separately. In summary, the paper focuses on improving the performance of TAD and MR tasks simultaneously by designing a unified model architecture and effective fusion learning strategies. This approach not only helps reduce costs but also enhances the overall effectiveness of these tasks.