Abstract:Due to the detrimental impact of noise on the conventional audio speech recognition (ASR) task, audio-visual speech recognition~(AVSR) has been proposed by incorporating both audio and visual video signals. Although existing methods have demonstrated that the aligned visual input of lip movements can enhance the robustness of AVSR systems against noise, the paired videos are not always available during inference, leading to the problem of the missing visual modality, which restricts their practicality in real-world scenarios. To tackle this problem, we propose a Discrete Feature based Visual Generative Model (DFVGM) which exploits semantic correspondences between the audio and visual modalities during training, generating visual hallucinations in lieu of real videos during inference. To achieve that, the primary challenge is to generate the visual hallucination given the noisy audio while preserving semantic correspondences with the clean speech. To tackle this challenge, we start with training the audio encoder in the Audio-Only (AO) setting, which generates continuous semantic features closely associated with the linguistic information. Simultaneously, the visual encoder is trained in the Visual-Only (VO) setting, producing visual features that are phonetically related. Next, we employ K-means to discretize the continuous audio and visual feature spaces. The discretization step allows DFVGM to capture high-level semantic structures that are more resilient to noise and generate visual hallucinations with high quality. To evaluate the effectiveness and robustness of our approach, we conduct extensive experiments on two publicly available datasets. The results demonstrate that our method achieves a remarkable 53% relative reduction (30.5%->12.9%) in Word Error Rate (WER) on average compared to the current state-of-the-art Audio-Only (AO) baselines while maintaining comparable results (< 5% difference) under the Audio-Visual (AV) setting even without video as input.

AVR: Synergizing Foundation Models for Audio-Visual Humor Detection

We are humor beings: Understanding and predicting visual humor

Comment-aided Video-Language Alignment via Contrastive Pre-training for Short-form Video Humor Detection

AVATAR: Unconstrained Audiovisual Speech Recognition

Advancing Humor-Focused Sentiment Analysis through Improved Contextualized Embeddings and Model Architecture

LOLgorithm: Integrating Semantic,Syntactic and Contextual Elements for Humor Classification

Audiovisual Highlight Detection in Videos

MIS-AVoiDD: Modality Invariant and Specific Representation for Audio-Visual Deepfake Detection

AVA-AVD: Audio-Visual Speaker Diarization in the Wild

DeHumor: Visual Analytics for Decomposing Humor

Visual Hallucination Elevates Speech Recognition

Highlight Timestamp Detection Model for Comedy Videos via Multimodal Sentiment Analysis

Human Audio-Visual Consonant Recognition Analyzed With Three Bimodal Integration Models

Towards Multimodal Prediction of Spontaneous Humour: A Novel Dataset and First Results

Comment-Aware Multi-Modal Heterogeneous Pre-Training for Humor Detection in Short-Form Videos

Revisiting audio visual scene-aware dialog

Getting Serious about Humor: Crafting Humor Datasets with Unfunny Large Language Models

QuMIN: quantum multi-modal data fusion for humor detection

AVTrustBench: Assessing and Enhancing Reliability and Robustness in Audio-Visual LLMs

Audio-Visual Fusion for Emotion Recognition in the Valence-Arousal Space Using Joint Cross-Attention