Align and Aggregate: Compositional Reasoning with Video Alignment and Answer Aggregation for Video Question-Answering

Zhaohe Liao,Jiangtong Li,Li Niu,Liqing Zhang

2024-07-03

Abstract:Despite the recent progress made in Video Question-Answering (VideoQA), these methods typically function as black-boxes, making it difficult to understand their reasoning processes and perform consistent compositional reasoning. To address these challenges, we propose a \textit{model-agnostic} Video Alignment and Answer Aggregation (VA$^{3}$) framework, which is capable of enhancing both compositional consistency and accuracy of existing VidQA methods by integrating video aligner and answer aggregator modules. The video aligner hierarchically selects the relevant video clips based on the question, while the answer aggregator deduces the answer to the question based on its sub-questions, with compositional consistency ensured by the information flow along question decomposition graph and the contrastive learning strategy. We evaluate our framework on three settings of the AGQA-Decomp dataset with three baseline methods, and propose new metrics to measure the compositional consistency of VidQA methods more comprehensively. Moreover, we propose a large language model (LLM) based automatic question decomposition pipeline to apply our framework to any VidQA dataset. We extend MSVD and NExT-QA datasets with it to evaluate our VA$^3$ framework on broader scenarios. Extensive experiments show that our framework improves both compositional consistency and accuracy of existing methods, leading to more interpretable real-world VidQA models.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper attempts to address the issues of insufficient transparency and poor compositional consistency in the reasoning process of existing Video Question Answering (VideoQA) methods. Specifically: 1. **Lack of transparency in the reasoning process**: Existing VideoQA methods often operate as black-box models, making it difficult to understand the reasoning process behind them. This leads to poor consistency when handling complex questions. 2. **Poor compositional consistency**: Existing methods have limited compositional reasoning capabilities when dealing with questions involving temporal relationships and multiple visual cues, resulting in decreased accuracy. To address these issues, the authors propose a model-agnostic framework called Video Alignment and Answer Aggregation (VA3), which aims to improve existing VideoQA methods by enhancing compositional consistency and accuracy. The framework includes two main modules: the video aligner and the answer aggregator. - **Video Aligner**: Selects relevant video segments hierarchically based on the question, aligning them from object-level, appearance-level to motion-level. - **Answer Aggregator**: Infers the final answer based on sub-questions and their video-question joint representations in the Question Decomposition Graph (QDG), ensuring compositional consistency. Additionally, the authors propose an automatic question decomposition pipeline based on large language models (LLM) to apply the framework to any VideoQA dataset and extend the MSVD and NExT-QA datasets to validate the framework's effectiveness. Experimental results show that the framework significantly improves the compositional consistency and accuracy of existing methods.

Align and Aggregate: Compositional Reasoning with Video Alignment and Answer Aggregation for Video Question-Answering

Maintaining Reasoning Consistency in Compositional Visual Question Answering

Compositional Substitutivity of Visual Reasoning for Visual Question Answering

Overcoming Language Priors In Vqa Via Decomposed Linguistic Representations

Language-Guided Visual Aggregation Network for Video Question Answering

Video Question Answering with Semantic Disentanglement and Reasoning

VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding

LiVLR: A Lightweight Visual-Linguistic Reasoning Framework for Video Question Answering

ViLA: Efficient Video-Language Alignment for Video Question Answering

MoReVQA: Exploring Modular Reasoning Models for Video Question Answering

Focal and Composed Vision-semantic Modeling for Visual Question Answering.

Reasoning with Heterogeneous Graph Alignment for Video Question Answering.

Hierarchical synchronization with structured multi-granularity interaction for video question answering

Neural-Symbolic VideoQA: Learning Compositional Spatio-Temporal Reasoning for Real-world Video Question Answering

End-to-End Video Question Answering with Frame Scoring Mechanisms and Adaptive Sampling

ANetQA: A Large-scale Benchmark for Fine-grained Compositional Reasoning over Untrimmed Videos

VideoQA in the Era of LLMs: An Empirical Study

Explore Multi-Step Reasoning in Video Question Answering

TG-VQA: Ternary Game of Video Question Answering

Multi-granularity Contrastive Cross-modal Collaborative Generation for End-to-End Long-term Video Question Answering

3D-Aware Visual Question Answering about Parts, Poses and Occlusions