Deep Learning Enabled Semantic Communications with Speech Recognition and Synthesis

Zhenzi Weng,Zhijin Qin,Xiaoming Tao,Chengkang Pan,Guangyi Liu,Geoffrey Ye Li
DOI: https://doi.org/10.48550/arXiv.2205.04603
2023-03-31
Abstract:In this paper, we develop a deep learning based semantic communication system for speech transmission, named DeepSC-ST. We take the speech recognition and speech synthesis as the transmission tasks of the communication system, respectively. First, the speech recognition-related semantic features are extracted for transmission by a joint semantic-channel encoder and the text is recovered at the receiver based on the received semantic features, which significantly reduces the required amount of data transmission without performance degradation. Then, we perform speech synthesis at the receiver, which dedicates to re-generate the speech signals by feeding the recognized text and the speaker information into a neural network module. To enable the DeepSC-ST adaptive to dynamic channel environments, we identify a robust model to cope with different channel conditions. According to the simulation results, the proposed DeepSC-ST significantly outperforms conventional communication systems and existing DL-enabled communication systems, especially in the low signal-to-noise ratio (SNR) regime. A software demonstration is further developed as a proof-of-concept of the DeepSC-ST.
Audio and Speech Processing,Sound
What problem does this paper attempt to address?
### Problems the paper attempts to solve The paper aims to solve the inefficiency problem faced by traditional communication systems when transmitting voice signals. Specifically, the paper proposes a deep - learning - based semantic communication system (DeepSC - ST) for voice transmission. This system significantly reduces the amount of required data transmission by extracting task - related semantic features while maintaining the transmission performance without degradation. The main objectives include: 1. **Improve transmission efficiency**: By extracting and transmitting low - dimensional text - related semantic features instead of transmitting the complete voice signal, thereby significantly reducing network traffic. 2. **Adapt to dynamic channel environments**: A robust model is designed to cope with different channel conditions. 3. **Support diverse user needs**: The system can not only provide text information but also regenerate voice signals according to user needs. 4. **Perform excellently under low signal - to - noise ratio conditions**: In a low signal - to - noise ratio (SNR) environment, the performance of DeepSC - ST is significantly better than that of traditional communication systems and other deep - learning - based communication systems. ### System overview The main components of the DeepSC - ST system include: - **Semantic Encoder**: Use convolutional neural network (CNN) and bidirectional recurrent neural network (BRNN) modules to extract text - related semantic features from the input voice signal. - **Channel Encoder**: Convert the extracted semantic features into symbols for transmission on the physical channel. - **Channel Decoder**: Receive the transmitted symbols and restore the text - related semantic features. - **Feature Decoder**: Decode the restored semantic features into the final text transcription. - **Speech Synthesis Module**: Regenerate the voice signal according to the recognized text and user ID. ### Main contributions 1. **Propose a new semantic communication system (DeepSC - ST)**: Applicable to communication scenarios with voice input, and develop a joint semantic - channel coding scheme. 2. **Significantly reduce the amount of transmitted data**: By extracting text - related semantic features, the required communication resources are reduced without affecting performance. 3. **Achieve diverse system outputs**: By developing voice recognition and voice synthesis tasks, the system can provide text information or voice signals according to user needs. 4. **Construct a demonstration system with an operation interface**: Based on real human voice input, generate recognized text and synthesized voice. ### Performance evaluation The paper evaluates the performance of the system through the following indicators: 1. **Voice recognition tasks**: - **Character Error Rate (CER)**: Calculate the number of character substitutions, deletions, and insertions between the restored text and the original text. - **Word Error Rate (WER)**: Calculate the number of word substitutions, deletions, and insertions between the restored text and the original text. 2. **Voice synthesis tasks**: - **Unconditional Fréchet DeepSpeech Distance (FDSD)**: Measure the distribution similarity between the synthesized voice and the real voice. - **Unconditional Kernel DeepSpeech Distance (KDSD)**: Measure the distribution similarity between the synthesized voice and the real voice through the Maximum Mean Discrepancy (MMD). Through these indicators, the paper demonstrates the advantages of DeepSC - ST in terms of transmission efficiency and performance.