TLNet: Temporal Span Localization Network with Collaborative Graph Reasoning for Video Question Answering

Lili Liang,Guanglu Sun,Tianlin Li,Shuai Liu,Weiping Ding
DOI: https://doi.org/10.1109/tetci.2024.3452751
2024-01-01
IEEE Transactions on Emerging Topics in Computational Intelligence
Abstract:Video question answering (VideoQA) has witnessed remarkable progress in the past few years, but there are still challenges in precisely locating question-related segments and reasoning spatiotemporal relationships. Targeting these challenges, a Temporal Span Localization Network (TLNet) is proposed, which comprises Temporal Span Localization (TSL) and Collaborative Graph Reasoning (CGR). TSL is introduced to precisely locate question-related segments by employing a cross-modal attention localization strategy that predicts the start and end moments of temporal span proposals. The proposals are refined through a binarized alignment fusion approach. Furthermore, CGR combines the graph structure and Transformer to reason spatiotemporal relationships and acquire unbiased intra- and inter-modal cues for answering questions. Specifically, Transformer is enhanced by leveraging information from the edges and nodes of different modality graphs, which enables the multi-head attention to be effectively guided. The Channel-Wise Normalization (CW Norm) is integrated into the Transformer for unbiasing intra- and inter-modal cues and optimizing network performance. Experimental evaluations on the TVQA and TVQA+ datasets demonstrate that TLNet outperforms the previous state-of-the-art methods. Additionally, extensive ablation studies are conducted to demonstrate the effectiveness of key components.
What problem does this paper attempt to address?