Abstract:An increasing number of Chinese people are troubled by different degrees of visual impairment, which has made the modal conversion between a single image or video frame in the visual field and the audio expressing the same information a research hotspot. Deep learning technologies such as OCR+Vocoder and Im2Wav enable English audio synthesis or image-to-sound matching in a self-supervised manner. However, the audio data used for training is limited and English is not universal for visually impaired people with different educational levels. Therefore, for the sake of solving the problems of data volume and language applicability to improve the reading efficiency of visually impaired people, a set of image-to-speech framework CLIP-KNN-Fastspeech2 based on the Chinese context was constructed. The framework integrates multiple basic models and adopts the strategy of independent pre-training and joint fine-tuning. First, the Chinese CLIP and Fastspeech2 text-to-speech models were pre-trained on two public datasets, MUGE and Baker, respectively, and their convergence was verified. Subsequently, joint fine-tuning was performed using a self-built Braille image dataset. Experimental results on multiple public datasets such as VGGSound, Flickr8k, ImageHear, and the self-built Braille dataset BIT-DP show that the model has improved objective indicators such as BLEU4,FAD(Fréchet Audio Distance), WER(Word Error Ratio), and even inference speed. This verifies that the constructed model still has the ability to synthesize high-quality speech under limited data, and also proves the effectiveness of the joint training strategy that integrates multiple basic models.

Multi-Modal Knowledge Transfer for Target Speaker Lipreading with Improved Audio-Visual Pretraining and Cross-Lingual Fine-Tuning

Lip Reading Based on 3D Face Modeling and Spatial Transformation Learning

AudioVSR: Enhancing Video Speech Recognition with Audio Data

Summary on the Chat-Scenario Chinese Lipreading (chatclr) Challenge

Multi-Grained Spatio-temporal Modeling for Lip-reading

Spatio-Temporal Attention Mechanism and Knowledge Distillation for Lip Reading

Enhancing Lip Reading with Multi-Scale Video and Multi-Encoder

Cross-Modal Language Modeling in Multi-Motion-Informed Context for Lip Reading

LipFormer: Learning to Lipread Unseen Speakers based on Visual-Landmark Transformers

Lip-Listening: Mixing Senses to Understand Lips using Cross Modality Knowledge Distillation for Word-Based Models

LipSound2: Self-Supervised Pre-Training for Lip-to-Speech Reconstruction and Lip Reading

Hearing Lips: Improving Lip Reading by Distilling Speech Recognizers

CrossMAE: Cross Modality Masked Autoencoders for Region-Aware Audio-Visual Pretraining

Landmark-Guided Cross-Speaker Lip Reading with Mutual Information Regularization

Improving Audio-Visual Speech Recognition by Lip-Subword Correlation Based Visual Pre-training and Cross-Modal Fusion Encoder

Cross-Modal Knowledge Distillation Method for Automatic Cued Speech Recognition

CTAL: Pre-training Cross-modal Transformer for Audio-and-Language Representations

Braille-to-Speech Generator: Audio Generation Based on Joint Fine-Tuning of CLIP and Fastspeech2

A practical approach to the child with multiple congenital anomalies.

Synchronous Bidirectional Learning for Multilingual Lip Reading

Lip-to-Speech Synthesis in the Wild with Multi-task Learning