Boosting Speech Recognition Robustness to Modality-Distortion with Contrast-Augmented Prompts

Dongjie Fu,Xize Cheng,Xiaoda Yang,Hanting Wang,Zhou Zhao,Tao Jin
DOI: https://doi.org/10.1145/3664647.3681347
2024-01-01
Abstract:In the burgeoning field of Audio-Visual Speech Recognition (AVSR), extant research has predominantly concentrated on the training paradigms tailored for high-quality resources. However, owing to the challenges inherent in real-world data collection, audio-visual data are frequently affected by modality-distortion, which encompasses audio-visual asynchrony, video noise and audio noise. The recognition accuracy of existing AVSR method is significantly compromised when multiple modality-distortion coexist in low-resource data. In light of the above challenges, we propose PCD: cluster-Prompt with Contrastive Decomposition, a robust framework for modality-distortion speech recognition, specifically devised to transpose the pre-trained knowledge from high-resource domain to the targeted domain by leveraging contrast-augmented prompts. In contrast to previous studies, we take into consideration the possibility of various types of distortion in both the audio and visual modalities. Concretely, we design bespoke prompts to delineate each modality-distortion, guiding the model to achieve speech recognition applicable to various distortion scenarios with quite few learnable parameters. To materialize the prompt mechanism, we employ multiple cluster-based strategies that better suits the pre-trained audio-visual model. Additionally, we design a contrastive decomposition mechanism to restrict the explicit relationships among various modality conditions, given their shared task knowledge and disparate modality priors. Extensive results on LRS2 dataset demonstrate that PCD achieves state-of-the-art performance for audio-visual speech recognition under the constraints of distorted resources. Code is available at https://github.com/ballooncatt/PCD.
What problem does this paper attempt to address?