DPHANet: Discriminative Parallel and Hierarchical Attention Network for Natural Language Video Localization
Ruihan Chen,Junpeng Tan,Zhijing Yang,Xiaojun Yang,Qingyun Dai,Yongqiang Cheng,Liang Lin
DOI: https://doi.org/10.1109/tmm.2024.3395888
IF: 7.3
2024-01-01
IEEE Transactions on Multimedia
Abstract:Natural Language Video Localization (NLVL) has recently attracted much attention because of its practical significance. However, the existing methods still face the following challenges: 1) When the models learn intra-modal semantic association, the temporal causal interaction information and contextual semantic discriminative information are ignored, resulting in the lack of intra-modal semantic context connection; 2) When learning fusion representations, existing cross-modal interaction modules lack hierarchical attention function to extract inter-modal similarity information and intra-modal self-correlation information, resulting in insufficient cross-modal information interaction; and 3) When the loss function is optimized, the existing models ignore the correlation of causal inference between the start and end boundaries, resulting in inaccurate start and end boundary calibrations. To conquer the above challenges, we proposed a novel NLVL model, called Discriminative Parallel and Hierarchical Attention Network (DPHANet). Specifically, we emphasized the importance of temporal causal interaction information and contextual semantic discriminative information and correspondingly proposed a Discriminative Parallel Attention Encoder (DPAE) module to infer and encode the above critical information. Besides, to overcome the shortcomings of the existing cross-modal interaction modules, we designed a Video-Query Hierarchical Attention (VQHA) module, which can perform cross-modal interaction and intra-modal self-correlation modeling in a hierarchical manner. Furthermore, a novel deviation loss function was proposed to capture the correlation of causal inference between the start and end boundaries and force the model to focus on the continuity and temporal causality in the video. Finally, extensive experiments on three benchmark datasets demonstrated the superiority of our proposed DPHANet model, which has achieved about 1.5% and 3.5% average performance improvement and about 2.5% and 7.5% maximum performance improvement on the Charades-STA and TACoS datasets respectively.