Abstract:The goal of this work is to reconstruct speech from silent video, in both speaker dependent and independent ways. Unlike previous works that have been mostly restricted to a speaker dependent setting, we propose Visual Voice memory to restore essential auditory information to generate proper speech from different speakers and even unseen speakers. The proposed memory takes additional auditory information that corresponds to the input face movements and stores the auditory contexts that can be recalled by the given input visual features. Specifically, the Visual Voice memory contains value and key memory slots, where value memory slots are for saving the audio features, and key memory slots are for storing the visual features in the same location of the saved audio features. Guiding each memory to properly save each feature, the model can adequately produce the speech. Hence, our method employs both video and audio information during training time but does not require any additional auditory input during inference. Our key contributions are: (1) proposing the Visual Voice memory that brings rich information of audio that complements the visual features, thus producing high-quality speech from silent video, and (2) enabling multi-speaker and unseen speaker training by memorizing auditory features and the corresponding visual features. We validate the proposed framework on GRID and Lip2Wav datasets and show that our method surpasses the performance of previous works on both multi-speaker and speaker independent settings. We also demonstrate that the Visual Voice memory contains meaningful information to reconstruct speech.

SPEAKER RETRIEVAL FOR TV SHOW VIDEOS BY ASSOCIATING AUDIO SPEAKER RECOGNITION RESULT TO VISUAL FACES

Audio-Visual System for Robust Speaker Recognition.

Efficient Identification Of Speakers In News Video Based On Shot Segmentation

Person-Based Video Summarization And Retrieval By Tracking And Clustering Temporal Face Sequences

Video parsing based on head tracking and face recognition

Seeing Through Noise: Visually Driven Speaker Separation and Enhancement

Audio-visual Speaker Recognition with a Cross-modal Discriminative Network

Speaker Clustering Aided by Visual Dialogue Analysis

Audio-visual multi-person tracking and identification for smart environments

Integration of audio-visual information for multi-speaker multimedia speaker recognition

Robust Speaking Face Identification For Video Analysis

Audio-Visual Talker Localization in Video for Spatial Sound Reproduction

VisageSynTalk: Unseen Speaker Video-to-Speech Synthesis via Speech-Visage Feature Selection

Speech Reconstruction With Reminiscent Sound Via Visual Voice Memory

Unsupervised Person Clustering in Videos with Cross-Modal Communication.

Audio visual character profiles for detecting background characters in entertainment media

Character-aware audio-visual subtitling in context

Automatic Naming of Speakers in Video via Name-Face Mapping.

DyViSE: Dynamic Vision-Guided Speaker Embedding for Audio-Visual Speaker Diarization

A Facial Feature and Lip Movement Enhanced Audio-Visual Speech Separation Model

Appearance Matters, So Does Audio: Revealing the Hidden Face via Cross-Modality Transfer