Abstract:As an alternative approach, viseme-based lipreading systems have demonstrated promising performance results in decoding videos of people uttering entire sentences. However, the overall performance of such systems has been significantly affected by the efficiency of the conversion of visemes to words during the lipreading process. As shown in the literature, the issue has become a bottleneck of such systems where the system’s performance can decrease dramatically from a high classification accuracy of visemes (e.g., over 90%) to a comparatively very low classification accuracy of words (e.g., only just over 60%). The underlying cause of this phenomenon is that roughly half of the words in the English language are homophemes, i.e., a set of visemes can map to multiple words, e.g., “time” and “some”. In this paper, aiming to tackle this issue, a deep learning network model with an Attention based Gated Recurrent Unit is proposed for efficient viseme-to-word conversion and compared against three other approaches. The proposed approach features strong robustness, high efficiency, and short execution time. The approach has been verified with analysis and practical experiments of predicting sentences from benchmark LRS2 and LRS3 datasets. The main contributions of the paper are as follows: (1) A model is developed, which is effective in converting visemes to words, discriminating between homopheme words, and is robust to incorrectly classified visemes; (2) the model proposed uses a few parameters and, therefore, little overhead and time are required to train and execute; and (3) an improved performance in predicting spoken sentences from the LRS2 dataset with an attained word accuracy rate of 79.6%—an improvement of 15.0% compared with the state-of-the-art approaches.

ViSPer: A Multilingual TTS Approach Based on VITS Using Deep Feature Loss

AudioVSR: Enhancing Video Speech Recognition with Audio Data

Accent-VITS:accent transfer for end-to-end TTS

VITS-Based Singing Voice Conversion Leveraging Whisper and multi-scale F0 Modeling

TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation

ViT-TTS: Visual Text-to-Speech with Scalable Diffusion Transformer

Improving Model Stability and Training Efficiency in Fast, High Quality Expressive Voice Conversion System

VCVTS: Multi-speaker Video-to-Speech synthesis via cross-modal knowledge transfer from voice conversion

EfficientTTS 2: Variational End-to-End Text-to-Speech Synthesis and Voice Conversion

PAVITS: Exploring Prosody-aware VITS for End-to-End Emotional Voice Conversion

FLY-TTS: Fast, Lightweight and High-Quality End-to-End Text-to-Speech Synthesis

PolyVoice: Language Models for Speech to Speech Translation

An Effective Conversion of Visemes to Words for High-Performance Automatic Lipreading

Zero-shot Cross-lingual Voice Transfer for TTS

Building Multi lingual TTS using Cross Lingual Voice Conversion

VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic Feature

RetrieverTTS: Modeling Decomposed Factors for Text-Based Speech Insertion

ViSpeR: Multilingual Audio-Visual Speech Recognition

Improving Cross-Lingual Speech Synthesis with Triplet Training Scheme

VITS-based Singing Voice Conversion System with DSPGAN post-processing for SVCC2023

Multi-Speaker Multi-Lingual VQTTS System for LIMMITS 2023 Challenge