Abstract:Due to the detrimental impact of noise on the conventional audio speech recognition (ASR) task, audio-visual speech recognition~(AVSR) has been proposed by incorporating both audio and visual video signals. Although existing methods have demonstrated that the aligned visual input of lip movements can enhance the robustness of AVSR systems against noise, the paired videos are not always available during inference, leading to the problem of the missing visual modality, which restricts their practicality in real-world scenarios. To tackle this problem, we propose a Discrete Feature based Visual Generative Model (DFVGM) which exploits semantic correspondences between the audio and visual modalities during training, generating visual hallucinations in lieu of real videos during inference. To achieve that, the primary challenge is to generate the visual hallucination given the noisy audio while preserving semantic correspondences with the clean speech. To tackle this challenge, we start with training the audio encoder in the Audio-Only (AO) setting, which generates continuous semantic features closely associated with the linguistic information. Simultaneously, the visual encoder is trained in the Visual-Only (VO) setting, producing visual features that are phonetically related. Next, we employ K-means to discretize the continuous audio and visual feature spaces. The discretization step allows DFVGM to capture high-level semantic structures that are more resilient to noise and generate visual hallucinations with high quality. To evaluate the effectiveness and robustness of our approach, we conduct extensive experiments on two publicly available datasets. The results demonstrate that our method achieves a remarkable 53% relative reduction (30.5%->12.9%) in Word Error Rate (WER) on average compared to the current state-of-the-art Audio-Only (AO) baselines while maintaining comparable results (< 5% difference) under the Audio-Visual (AV) setting even without video as input.

Hallucination in Perceptual Metric-Driven Speech Enhancement Networks

Embedding and Gradient Say Wrong: A White-Box Method for Hallucination Detection

Hallucinations in Neural Automatic Speech Recognition: Identifying Errors and Hallucinatory Models

Perceptually Guided Speech Enhancement Using Deep Neural Networks

Non-Intrusive Speech Intelligibility Prediction for Hearing-Impaired Users using Intermediate ASR Features and Human Memory Models

In-Context Sharpness as Alerts: An Inner Representation Perspective for Hallucination Mitigation

Enhancing Hallucination Detection through Perturbation-Based Synthetic Data Generation in System Responses

A Perceptually-Motivated Approach for Low-Complexity, Real-Time Enhancement of Fullband Speech

Visual Hallucination Elevates Speech Recognition

Non-intrusive speech quality assessment using neural networks

On the Role of Noise in AudioVisual Integration: Evidence from Artificial Neural Networks that Exhibit the McGurk Effect

Multi-CMGAN+/+: Leveraging Multi-Objective Speech Quality Metric Prediction for Speech Enhancement

Pre-Training Multimodal Hallucination Detectors with Corrupted Grounding Data

On the Behavior of Intrusive and Non-intrusive Speech Enhancement Metrics in Predictive and Generative Settings

Non Intrusive Intelligibility Predictor for Hearing Impaired Individuals using Self Supervised Speech Representations

Attention-Based Speech Enhancement Using Human Quality Perception Modeling

Using Speech Foundational Models in Loss Functions for Hearing Aid Speech Enhancement

Looks Too Good To Be True: An Information-Theoretic Analysis of Hallucinations in Generative Restoration Models

Optimizing the Perceptual Quality of Time-Domain Speech Enhancement with Reinforcement Learning

Detecting Hallucinations in Virtual Histology with Neural Precursors

Developing a Reliable, General-Purpose Hallucination Detection and Mitigation Service: Insights and Lessons Learned