LiVLR: A Lightweight Visual-Linguistic Reasoning Framework for Video Question Answering

Jingjing Jiang,Ziyi Liu,Nanning Zheng
DOI: https://doi.org/10.1109/tmm.2022.3185900
IF: 7.3
2022-01-01
IEEE Transactions on Multimedia
Abstract:Video Question Answering (VideoQA), aiming to correctly answer a given question based on understanding multimodal video content, is challenging due to the richness of the video content. From the perspective of video understanding, a complete VideoQA framework needs to understand the video content at different semantic levels and flexibly integrate diverse video content to distill question-related content. To this end, we propose a Lightweight Visual-Linguistic Reasoning framework named $ ext{LiVLR}$. Specifically, $ ext{LiVLR}$ first utilizes graph-based visual and linguistic encoders to obtain multi-grained visual and linguistic representations, respectively. Subsequently, the obtained representations are integrated with the devised Diversity-aware Visual-Linguistic Reasoning module ($ ext{DaVL}$). $ ext{DaVL}$ distinguishes different types of representations with the learnable index embedding in graph embedding. Therefore, $ ext{DaVL}$ can flexibly adjust the importance of different representations when generating the question-related joint representation. The proposed $ ext{LiVLR}$ is lightweight and shows its performance advantage on three VideoQA benchmarks, MRSVTT-QA, KnowIT VQA, and TVQA. Extensive ablation studies demonstrate the effectiveness of the key components of $ ext{LiVLR}$.
computer science, information systems,telecommunications, software engineering
What problem does this paper attempt to address?