Multi-Granularity Interaction and Integration Network for Video Question Answering

Yuanyuan Wang,Meng Liu,Jianlong Wu,Liqiang Nie
DOI: https://doi.org/10.1109/tcsvt.2023.3278492
IF: 5.859
2023-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:Video question answering, aiming to answer a natural language question related to the given video, has gained popularity in the last few years. Although significant improvements have been achieved, it is still confronted with two challenges: the sufficient comprehension of video content and the long-tailed answers. To this end, we propose a multi-granularity interaction and integration network for video question answering. It jointly explores multi-level intra-granularity and inter-granularity relations to enhance the comprehension of videos. To be specific, we first build a word-enhanced visual representation module to achieve cross-modal alignment. And then we advance a multi-granularity interaction module to explore the intra-granularity and inter-granularity relationships. Finally, a question-guided interaction module is developed to select question-related visual representations for answer prediction. In addition, we employ the seesaw loss for open-ended tasks to alleviate the long-tailed word distribution effect. Both the quantitative and qualitative results on TGIF-QA, MSRVTT-QA, and MSVD-QA datasets demonstrate the superiority of our model over several state-of-the-art approaches.
engineering, electrical & electronic
What problem does this paper attempt to address?