Abstract:The goal of this work is to reconstruct speech from silent video, in both speaker dependent and independent ways. Unlike previous works that have been mostly restricted to a speaker dependent setting, we propose Visual Voice memory to restore essential auditory information to generate proper speech from different speakers and even unseen speakers. The proposed memory takes additional auditory information that corresponds to the input face movements and stores the auditory contexts that can be recalled by the given input visual features. Specifically, the Visual Voice memory contains value and key memory slots, where value memory slots are for saving the audio features, and key memory slots are for storing the visual features in the same location of the saved audio features. Guiding each memory to properly save each feature, the model can adequately produce the speech. Hence, our method employs both video and audio information during training time but does not require any additional auditory input during inference. Our key contributions are: (1) proposing the Visual Voice memory that brings rich information of audio that complements the visual features, thus producing high-quality speech from silent video, and (2) enabling multi-speaker and unseen speaker training by memorizing auditory features and the corresponding visual features. We validate the proposed framework on GRID and Lip2Wav datasets and show that our method surpasses the performance of previous works on both multi-speaker and speaker independent settings. We also demonstrate that the Visual Voice memory contains meaningful information to reconstruct speech.

Intuitive Multilingual Audio-Visual Speech Recognition with a Single-Trained Model

AudioVSR: Enhancing Video Speech Recognition with Audio Data

Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation

Multimodal Pretraining from Monolingual to Multilingual

Learning Cross-lingual Visual Speech Representations

XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception

Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs

Multilingual Approach to Joint Speech and Accent Recognition with DNN-HMM Framework

A Visual Attention Grounding Neural Model for Multimodal Machine Translation

Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer

Towards Language-Universal Mandarin-English Speech Recognition

Speech Reconstruction With Reminiscent Sound Via Visual Voice Memory

Integrating international Chinese visualization teaching and vocational skills training: leveraging attention-connectionist temporal classification models

A Unified Audio-Visual Learning Framework for Localization, Separation, and Recognition

Multi-task Recurrent Model for True Multilingual Speech Recognition

Audio-visual Speaker Recognition with a Cross-modal Discriminative Network

Collaborative Joint Training With Multitask Recurrent Model for Speech and Speaker Recognition.

Distinguishing Homophenes Using Multi-Head Visual-Audio Memory for Lip Reading

MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and Recognition.

Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual Downstream Tasks

Multi-Temporal Lip-Audio Memory for Visual Speech Recognition