UniMD: Towards Unifying Moment Retrieval and Temporal Action Detection

Yingsen Zeng,Yujie Zhong,Chengjian Feng,Lin Ma

2024-07-11

Abstract:Temporal Action Detection (TAD) focuses on detecting pre-defined actions, while Moment Retrieval (MR) aims to identify the events described by open-ended natural language within untrimmed videos. Despite that they focus on different events, we observe they have a significant connection. For instance, most descriptions in MR involve multiple actions from TAD. In this paper, we aim to investigate the potential synergy between TAD and MR. Firstly, we propose a unified architecture, termed Unified Moment Detection (UniMD), for both TAD and MR. It transforms the inputs of the two tasks, namely actions for TAD or events for MR, into a common embedding space, and utilizes two novel query-dependent decoders to generate a uniform output of classification score and temporal segments. Secondly, we explore the efficacy of two task fusion learning approaches, pre-training and co-training, in order to enhance the mutual benefits between TAD and MR. Extensive experiments demonstrate that the proposed task fusion learning scheme enables the two tasks to help each other and outperform the separately trained counterparts. Impressively, UniMD achieves state-of-the-art results on three paired datasets Ego4D, Charades-STA, and ActivityNet. Our code is available at <a class="link-external link-https" href="https://github.com/yingsen1/UniMD" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper aims to address two related but currently independently handled problems: Temporal Action Detection (TAD) and Moment Retrieval (MR). Specifically, the goal of the paper is to explore the potential synergies between these two tasks and improve overall performance by integrating them. 1. **Unified Framework**: The authors propose a unified architecture called Unified Moment Detection (UniMD) that can handle both TAD and MR tasks simultaneously. The core of this framework lies in transforming the inputs of both tasks into a common embedding space and using a novel query-dependent decoder to generate unified outputs of classification scores and temporal segments. 2. **Task Fusion Learning**: The paper also explores two task fusion learning methods—pre-training and co-training—to enhance the mutual influence between TAD and MR. Specifically, two co-training methods are studied: synchronous task sampling and alternating task sampling, with the former proving to effectively improve the performance of each task. 3. **Experimental Results**: Extensive experiments on three paired datasets (Ego4D, Charades-STA, and ActivityNet) demonstrate that UniMD achieves state-of-the-art results on multiple metrics. Particularly, the performance under different amounts of training data shows that even with less data, the co-training method can achieve better performance than models trained separately. In summary, the paper focuses on improving the performance of TAD and MR tasks simultaneously by designing a unified model architecture and effective fusion learning strategies. This approach not only helps reduce costs but also enhances the overall effectiveness of these tasks.

UniMD: Towards Unifying Moment Retrieval and Temporal Action Detection

Exploiting Semantic-Level Affinities with a Mask-Guided Network for Temporal Action Proposal in Videos.

UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection

UniAV: Unified Audio-Visual Perception for Multi-Task Video Event Localization

MH-DETR: Video Moment and Highlight Detection with Cross-modal Transformer

Multi-Modal Few-Shot Temporal Action Detection

MABAN: Multi-Agent Boundary-Aware Network for Natural Language Moment Retrieval

MCMNET: Multi-Scale Context Modeling Network for Temporal Action Detection

UniHead: Unifying Multi-Perception for Detection Heads

MMTSA: Multimodal Temporal Segment Attention Network for Efficient Human Activity Recognition

Bridging the Gap: A Unified Video Comprehension Framework for Moment Retrieval and Highlight Detection

Faster-TAD: Towards Temporal Action Detection with Proposal Generation and Classification in a Unified Network

Learning 2D Temporal Adjacent Networks for Moment Localization with Natural Language

Temporal2Seq: A Unified Framework for Temporal Video Understanding Tasks

Harnessing Temporal Causality for Advanced Temporal Action Detection

UniT3D: A Unified Transformer for 3D Dense Captioning and Visual Grounding

Multi-Modal Cross-Domain Alignment Network for Video Moment Retrieval

Unified Static and Dynamic Network: Efficient Temporal Filtering for Video Grounding

Separately Guided Context-Aware Network for Weakly Supervised Temporal Action Detection

TadML: A fast temporal action detection with Mechanics-MLP

Multi-Modal Fusion and Query Refinement Network for Video Moment Retrieval and Highlight Detection