Camg: Context-Aware Moment Graph Network for Multimodal Temporal Activity Localization Via Language

Yuelin Hu,Yuanwu Xu,Yuejie Zhang,Rui Feng,Tao Zhang,Xuequan Lu,Shang Gao
DOI: https://doi.org/10.1007/978-3-031-44693-1_34
2023-01-01
Abstract:Temporal Activity Localization via Language (TALL) is a challenging task for language based video understanding, especially when a video contains multiple moments of interest and the language query has words describing complex context dependencies between the moments. Latest studies have proposed various ways to exploit the temporal context of adjacent moments, but two apparent limitations remained. First, only limited context information was encoded based on RNNs or 2-D convolutions, which highly depended on the pre-sorting of proposals and lacked flexibility. Second, semantically correlated content in different moments was ignored, i.e., semantic context. To address these limitations, we propose a novel GCN-based framework, i.e., Context-Aware Moment Graph (CAMG) network, to jointly model the temporal context and semantic context. GCNs enable the CAMG to capture long-range dependencies with high flexibility. Also, we design a multi-step fusion scheme to aggregate object, motion and textual
What problem does this paper attempt to address?