Abstract:Recent methods for video question answering (VideoQA), aiming to generate answers based on given questions and video content, have made significant progress in cross-modal interaction. From the perspective of video understating, these existing frameworks concentrate on the various levels of visual content, partially assisted by subtitles. However, audio information is also instrumental in helping get correct answers, especially in videos with real-life scenarios. Indeed, in some cases, both audio and visual contents are required and complement each other to answer questions, which is defined as audio-visual question answering (AVQA). In this paper, we focus on importing raw audio for AVQA and contribute in three ways. Firstly, due to no dataset annotating QA pairs for raw audio, we introduce E-AVQA, a manually annotated and large-scale dataset involving multiple modalities. E-AVQA consists of 34,033 QA pairs on 33,340 clips of 18,786 videos from the e-commerce scenarios. Secondly, we propose a multi-granularity relational attention method with contrastive constraints between audio and visual features after the interaction, named MGN, which captures local sequential representation by leveraging the pairwise potential attention mechanism and obtains global multi-modal representation via designing the novel ternary potential attention mechanism. Thirdly, our proposed MGN outperforms the baseline on dataset E-AVQA, achieving 20.73% on WUPS@0.0 and 19.81% on BLEU@1, demonstrating its superiority with at least 1.02 improvement on WUPS@0.0 and about 10% on timing complexity over the baseline.

GPA: Global and Prototype Alignment for Audio-Text Retrieval

AudioVSR: Enhancing Video Speech Recognition with Audio Data

Multiscale Matching Driven by Cross-Modal Similarity Consistency for Audio-Text Retrieval

Cross-Modal Global Interaction and Local Alignment for Audio-Visual Speech Recognition

Audio-text Retrieval with Transformer-based Hierarchical Alignment and Disentangled Cross-modal Representation

Improving Audio-Text Retrieval via Hierarchical Cross-Modal Interaction and Auxiliary Captions

Audio–text retrieval based on contrastive learning and collaborative attention mechanism

DiffATR: Diffusion-based Generative Modeling for Audio-Text Retrieval

Bridging the Gap between Text, Audio, Image, and Any Sequence: A Novel Approach using Gloss-based Annotation

Advancing Multi-grained Alignment for Contrastive Language-Audio Pre-training

STA-V2A: Video-to-Audio Generation with Semantic and Temporal Alignment

AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head

Video-to-Audio Generation with Fine-grained Temporal Semantics

Mutual Alignment between Audiovisual Features for End-to-End Audiovisual Speech Recognition

Realization of Global Audio Telepresence Via a Learning-Based Model-Matching Approach with an Acoustic Array System

Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation

Improving Text-Audio Retrieval by Text-aware Attention Pooling and Prior Matrix Revised Loss

Video-to-Audio Generation with Hidden Alignment

Multi-Granularity Relational Attention Network for Audio-Visual Question Answering

Cross-utterance ASR Rescoring with Graph-based Label Propagation

Semantic Proximity Alignment: Towards Human Perception-consistent Audio Tagging by Aligning with Label Text Description