Abstract:Seeking answers effectively for long videos is essential to build video question answering (videoQA) systems. Previous methods adaptively select frames and regions from long videos to save computations. However, this fails to reason over the whole sequence of video, leading to sub-optimal performance. To address this problem, we introduce a state space layer (SSL) into multi-modal Transformer to efficiently integrate global semantics of the video, which mitigates the video information loss caused by frame and region selection modules. Our SSL includes a gating unit to enable controllability over the flow of global semantics into visual representations. To further enhance the controllability, we introduce a cross-modal compositional congruence (C^3) objective to encourage global semantics aligned with the question. To rigorously evaluate long-form videoQA capacity, we construct two new benchmarks Ego-QA and MAD-QA featuring videos of considerably long length, i.e. 17.5 minutes and 1.9 hours, respectively. Extensive experiments demonstrate the superiority of our framework on these new as well as existing datasets. The code, model, and data have been made available at <a class="link-external link-https" href="https://nguyentthong.github.io/Long_form_VideoQA" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The paper attempts to address several key issues in long video question answering (videoQA) systems: 1. **Global Information Integration**: Existing methods save computational resources by selectively choosing video frames and regions, but this leads to insufficient reasoning over the entire video sequence, affecting performance. The paper proposes a State Space Layer (SSL) integrated into a multi-modal Transformer to efficiently integrate the global semantic information of the video, reducing the loss of video information caused by frame and region selection modules. 2. **Controllability of Global Semantics**: To enhance control over the flow of global semantics, the paper introduces a gating unit in the SSL and proposes a Cross-modal Compositional Congruence (C3) objective function. This encourages alignment between global semantics and the question, thereby improving the model's controllability and accuracy. 3. **Long Video QA Benchmark**: To more rigorously evaluate the performance of long video QA systems, the paper constructs two new benchmark datasets—Ego-QA and MAD-QA. These datasets contain videos with an average length of 17.5 minutes and 1.9 hours, respectively, and questions that require watching up to 1200 seconds of video to answer, which is more demanding than existing datasets. ### Main Contributions - **Gated State Space Multi-modal Transformer (GSMT)**: By introducing the State Space Layer (SSL) and gating mechanism, it effectively integrates the global information of the video and enhances the alignment of global semantics with the question through the C3 objective function. - **New Long Video QA Datasets**: Constructed the Ego-QA and MAD-QA datasets, which contain very long videos and complex questions, for more rigorous evaluation of long video QA systems. - **Experimental Validation**: Extensive experiments on multiple standard datasets validate the superiority of the proposed framework. ### Method Overview - **Input Embedding**: Embedding video frames and questions into visual and textual representations, respectively. - **Gated SSL**: Integrating global information through the State Space Layer and controlling the flow of global semantics via the gating unit. - **Visual Segment and Region Selection**: Selecting visual segments and regions relevant to the question through pooling operations. - **Multi-modal Attention**: Fusing information from the question and video through a self-attention mechanism to generate multi-modal hidden representations. - **Answer Prediction**: Selecting the most similar answer as the final prediction by calculating the cosine similarity between candidate answer features and the multi-modal representation. - **C3 Objective Function**: Ensuring consistency between visual representations and the question through the Cross-modal Compositional Congruence objective function. ### Experimental Results - The proposed method achieves superior performance over state-of-the-art methods on multiple standard datasets, including AGQA-v2, Env-QA, STAR, NExT-QA, and EgoSchema. - The proposed method also shows significant advantages on the newly constructed Ego-QA and MAD-QA datasets, especially for questions requiring complex reasoning. Through these improvements, the paper effectively enhances the performance and robustness of long video QA systems.

Encoding and Controlling Global Semantics for Long-form Video Question Answering

Video Question Answering with Semantic Disentanglement and Reasoning

A Simple LLM Framework for Long-Range Video Question-Answering

Streaming Long Video Understanding with Large Language Models

MovieChat+: Question-aware Sparse Memory for Long Video Question Answering

Fusing Temporally Distributed Multi-Modal Semantic Clues for Video Question Answering.

MoVQA: A Benchmark of Versatile Question-Answering for Long-Form Movie Understanding

Multi-granularity Contrastive Cross-modal Collaborative Generation for End-to-End Long-term Video Question Answering

Too Many Frames, Not All Useful: Efficient Strategies for Long-Form Video QA

Long-Term Video Question Answering Via Multimodal Hierarchical Memory Attentive Networks

Long Story Short: a Summarize-then-Search Method for Long Video Question Answering

End-to-End Video Question Answering with Frame Scoring Mechanisms and Adaptive Sampling

SEAL: Semantic Attention Learning for Long Video Representation

Towards Long-Form Video Understanding

Retrieval-based Video Language Model for Efficient Long Video Question Answering

Koala: Key frame-conditioned long video-LLM

EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding

MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering

LongVLM: Efficient Long Video Understanding via Large Language Models

Long-Form Video Question Answering via Dynamic Hierarchical Reinforced Networks