Align and Aggregate: Compositional Reasoning with Video Alignment and Answer Aggregation for Video Question-Answering

Zhaohe Liao,Jiangtong Li,Li Niu,Liqing Zhang
2024-07-03
Abstract:Despite the recent progress made in Video Question-Answering (VideoQA), these methods typically function as black-boxes, making it difficult to understand their reasoning processes and perform consistent compositional reasoning. To address these challenges, we propose a \textit{model-agnostic} Video Alignment and Answer Aggregation (VA$^{3}$) framework, which is capable of enhancing both compositional consistency and accuracy of existing VidQA methods by integrating video aligner and answer aggregator modules. The video aligner hierarchically selects the relevant video clips based on the question, while the answer aggregator deduces the answer to the question based on its sub-questions, with compositional consistency ensured by the information flow along question decomposition graph and the contrastive learning strategy. We evaluate our framework on three settings of the AGQA-Decomp dataset with three baseline methods, and propose new metrics to measure the compositional consistency of VidQA methods more comprehensively. Moreover, we propose a large language model (LLM) based automatic question decomposition pipeline to apply our framework to any VidQA dataset. We extend MSVD and NExT-QA datasets with it to evaluate our VA$^3$ framework on broader scenarios. Extensive experiments show that our framework improves both compositional consistency and accuracy of existing methods, leading to more interpretable real-world VidQA models.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper attempts to address the issues of insufficient transparency and poor compositional consistency in the reasoning process of existing Video Question Answering (VideoQA) methods. Specifically: 1. **Lack of transparency in the reasoning process**: Existing VideoQA methods often operate as black-box models, making it difficult to understand the reasoning process behind them. This leads to poor consistency when handling complex questions. 2. **Poor compositional consistency**: Existing methods have limited compositional reasoning capabilities when dealing with questions involving temporal relationships and multiple visual cues, resulting in decreased accuracy. To address these issues, the authors propose a model-agnostic framework called Video Alignment and Answer Aggregation (VA3), which aims to improve existing VideoQA methods by enhancing compositional consistency and accuracy. The framework includes two main modules: the video aligner and the answer aggregator. - **Video Aligner**: Selects relevant video segments hierarchically based on the question, aligning them from object-level, appearance-level to motion-level. - **Answer Aggregator**: Infers the final answer based on sub-questions and their video-question joint representations in the Question Decomposition Graph (QDG), ensuring compositional consistency. Additionally, the authors propose an automatic question decomposition pipeline based on large language models (LLM) to apply the framework to any VideoQA dataset and extend the MSVD and NExT-QA datasets to validate the framework's effectiveness. Experimental results show that the framework significantly improves the compositional consistency and accuracy of existing methods.