Abstract:Due to the detrimental impact of noise on the conventional audio speech recognition (ASR) task, audio-visual speech recognition~(AVSR) has been proposed by incorporating both audio and visual video signals. Although existing methods have demonstrated that the aligned visual input of lip movements can enhance the robustness of AVSR systems against noise, the paired videos are not always available during inference, leading to the problem of the missing visual modality, which restricts their practicality in real-world scenarios. To tackle this problem, we propose a Discrete Feature based Visual Generative Model (DFVGM) which exploits semantic correspondences between the audio and visual modalities during training, generating visual hallucinations in lieu of real videos during inference. To achieve that, the primary challenge is to generate the visual hallucination given the noisy audio while preserving semantic correspondences with the clean speech. To tackle this challenge, we start with training the audio encoder in the Audio-Only (AO) setting, which generates continuous semantic features closely associated with the linguistic information. Simultaneously, the visual encoder is trained in the Visual-Only (VO) setting, producing visual features that are phonetically related. Next, we employ K-means to discretize the continuous audio and visual feature spaces. The discretization step allows DFVGM to capture high-level semantic structures that are more resilient to noise and generate visual hallucinations with high quality. To evaluate the effectiveness and robustness of our approach, we conduct extensive experiments on two publicly available datasets. The results demonstrate that our method achieves a remarkable 53% relative reduction (30.5%->12.9%) in Word Error Rate (WER) on average compared to the current state-of-the-art Audio-Only (AO) baselines while maintaining comparable results (< 5% difference) under the Audio-Visual (AV) setting even without video as input.

Visual-guided scene-aware audio generation method based on hierarchical feature codec and rendering decision

AudioVSR: Enhancing Video Speech Recognition with Audio Data

Gotta Hear Them All: Sound Source Aware Vision to Audio Generation

Sounding Video Generator: A Unified Framework for Text-guided Sounding Video Generation

Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation

Efficient Video to Audio Mapper with Visual Scene Detection

Sound2Vision: Generating Diverse Visuals from Audio through Cross-Modal Latent Alignment

Cross-modal Generative Model for Visual-Guided Binaural Stereo Generation

Sound to Visual Scene Generation by Audio-to-Visual Latent Alignment

Learning Explicit and Implicit Latent Common Spaces for Audio-Visual Cross-Modal Retrieval

Exploiting Audio-Visual Consistency with Partial Supervision for Spatial Audio Generation

Multimodal Variational Auto-encoder based Audio-Visual Segmentation

SEE-2-SOUND: Zero-Shot Spatial Environment-to-Spatial Sound

Visual Hallucination Elevates Speech Recognition

Semantically consistent Video-to-Audio Generation using Multimodal Language Large Model

AV-GS: Learning Material and Geometry Aware Priors for Novel View Acoustic Synthesis

Visually-Guided Audio Spatialization in Video with Geometry-Aware Multi-task Learning

Diff-SAGe: End-to-End Spatial Audio Generation Using Diffusion Models

Self-supervised Learning of Audio Representations from Audio-Visual Data using Spatial Alignment

Audio-Visual Grouping Network for Sound Localization from Mixtures

Audio Matters in Video Super-Resolution by Implicit Semantic Guidance.