Abstract:Deep learning (DL) based semantic communication methods have been explored for the efficient transmission of images, text, and speech in recent years. In contrast to traditional wireless communication methods that focus on the transmission of abstract symbols, semantic communication approaches attempt to achieve better transmission efficiency by only sending the semantic-related information of the source data. In this paper, we consider semantic-oriented speech transmission which transmits only the semantic-relevant information over the channel for the speech recognition task, and a compact additional set of semantic-irrelevant information for the speech reconstruction task. We propose a novel end-to-end DL-based transceiver which extracts and encodes the semantic information from the input speech spectrums at the transmitter and outputs the corresponding transcriptions from the decoded semantic information at the receiver. In particular, we employ a soft alignment module and a redundancy removal module to extract only the text-related semantic features while dropping semantically redundant content, greatly reducing the amount of semantic redundancy compared to existing methods. We also propose a semantic correction module to further correct the predicted transcription with semantic knowledge by leveraging a pretrained language model. For the speech to speech transmission, we further include a CTC alignment module that extracts a small number of additional semantic-irrelevant but speech-related information, such as duration, pitch, power and speaker identification of the speech for the better reconstruction of the original speech signals at the receiver. We also introduce a two-stage training scheme which speeds up the training of the proposed DL model. The simulation results confirm that our proposed method outperforms current methods in terms of the accuracy of the predicted text for the speech to text transmission and the quality of the recovered speech signals for the speech to speech transmission, and significantly improves transmission efficiency. More specifically, the proposed method only sends 16% of the amount of the transmitted symbols required by the existing methods while achieving about a 10% reduction in WER for the speech to text transmission. For the speech to speech transmission, it results in an even more remarkable improvement in terms of transmission efficiency with only 0.2% of the amount of the transmitted symbols required by the existing method while preserving the comparable quality of the reconstructed speech signals.

Semantic Computing in Scalable Text-to-Speech System

Constructing Scalable TTS System Based on Corpus Approach

Minimally-Supervised Speech Synthesis with Conditional Diffusion Model and Language Model: A Comparative Study of Semantic Coding

A Synthesis Instance Pruning Approach Based on Virtual Non-Uniform Replacements

Semantic-preserved Communication System for Highly Efficient Speech Transmission

Generative Semantic Communication for Text-to-Speech Synthesis

Textless Streaming Speech-to-Speech Translation using Semantic Speech Tokens

Semantic-aware Speech to Text Transmission with Redundancy Removal

SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models

Enhancing Speech Intelligibility in Text-To-Speech Synthesis using Speaking Style Conversion

A real-time voice cloning system with multiple algorithms for speech quality improvement

Localized Mandarin Speech Synthesis Services For Enterprise Scenarios

Evaluating Text-to-Speech Synthesis from a Large Discrete Token-based Speech Language Model

On the Interplay Between Sparsity, Naturalness, Intelligibility, and Prosody in Speech Synthesis

ContextSpeech: Expressive and Efficient Text-to-Speech for Paragraph Reading

Cross-lingual Multi-speaker Text-to-speech Synthesis for Voice Cloning without Using Parallel Corpus for Unseen Speakers

Semantic Successive Refinement: A Generative AI-aided Semantic Communication Framework

SoCodec: A Semantic-Ordered Multi-Stream Speech Codec for Efficient Language Model Based Text-to-Speech Synthesis

Perceptual Clustering Based Unit Selection Optimization for Concatenative Text-to-speech Synthesis

A Robust Semantic Text Communication System

Synchronous Semantic Communications for Video and Speech