Abstract:Audio-visual question answering (AVQA) is an emerging task that aims to provide answers by integrating visual contents, audio streams, and their associations within given videos. The major challenge lies in effectively fusing heterogeneous multi-modal data to comprehend complex scenes while capturing question-related clues to infer correct answers. Current AVQA models primarily employ attention mechanisms to extract questionrelated clues separately from visual and audio modalities before combining them. However, these approaches have two limitations: (1) They neglect the exploration of the association and complementary between audio and visual; (2) Encoding visual or audio holistically limits the capacity to capture the cross-modal and crosstemporal dynamic events. In this paper, we introduce the Heterogeneous Interactive Graph Network, a novel solution designed to address these limitations. Specifically, we construct heterogeneous multi-modal graphs that facilitate unified integration of multiple modalities, including visual, audio, and question. This approach effectively explores the associations and complementarity among multiple modalities, and it investigates local temporal interactions across visual and audio, enabling the effective capture of cross-modal and cross-temporal dynamic events. Additionally, we present a cross-modal feature alignment module, which acts as a bridge to overcome the semantic gap among heterogeneous multi-modal data. It promotes the convergence of multimodal data distributions into a shared feature space, facilitating more effective and efficient processing. Extensive experimental results demonstrate the superiority of our method compared to state-of-the-art models across various question types on the challenging MUSIC-AVQA and AVQA benchmarks.

Progressive Spatio-temporal Perception for Audio-Visual Question Answering

Boosting Audio Visual Question Answering via Key Semantic-Aware Cues

Learning to Answer Questions in Dynamic Audio-Visual Scenarios

Target-Aware Spatio-Temporal Reasoning Via Answering Questions in Dynamic Audio-Visual Scenarios

Target-Aware Spatio-Temporal Reasoning via Answering Questions in Dynamics Audio-Visual Scenarios

Multi-Granularity Relational Attention Network for Audio-Visual Question Answering

Question-Aware Global-Local Video Understanding Network for Audio-Visual Question Answering

Progressive Graph Attention Network for Video Question Answering

Heterogeneous Interactive Graph Network for Audio-Visual Question Answering

Video Question Answering via Knowledge-based Progressive Spatial-Temporal Attention Network

Patch-level Sounding Object Tracking for Audio-Visual Question Answering

Object-aware Adaptive-Positivity Learning for Audio-Visual Question Answering

Spatiotemporal-Textual Co-Attention Network for Video Question Answering

AVQA: A Dataset for Audio-Visual Question Answering on Videos

Divide and Conquer: Question-Guided Spatio-Temporal Contextual Attention for Video Question Answering

Complementary Spatiotemporal Network for Video Question Answering.

Progressive Confident Masking Attention Network for Audio-Visual Segmentation

PQA: Perceptual Question Answering

TASTA: Text-Assisted Spatial and Temporal Attention Network for Video Question Answering.

Toward a Perceptive Pretraining Framework for Audio-Visual Video Parsing

Two-stream Spatiotemporal Feature for Video QA Task