Abstract:Currently, the development of Foreign Accent Conversion (FAC) models utilizes deep neural network architectures, as well as ensembles of neural networks for speech recognition and speech generation. The use of these models is limited by architectural features, which does not allow flexible changes in the timbre of the generated speech and requires the accumulation of context, leading to increased delays in generation and makes these systems unsuitable for use in real-time multi-user communication scenarios. We have developed the non-autoregressive model for real-time accent conversion with voice cloning. The model generates native-sounding L1 speech with minimal latency based on input L2 accented speech. The model consists of interconnected modules for extracting accent, gender, and speaker embeddings, converting speech, generating spectrograms, and decoding the resulting spectrogram into an audio signal. The model has the ability to save, clone and change the timbre, gender and accent of the speaker's voice in real time. The results of the objective assessment show that the model improves speech quality, leading to enhanced recognition performance in existing ASR systems. The results of subjective tests show that the proposed accent and gender encoder improves the generation quality. The developed model demonstrates high-quality low-latency accent conversion, voice cloning, and speech enhancement capabilities, making it suitable for real-time multi-user communication scenarios.
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the application problem of **Foreign Accent Conversion (FAC)** in real - time multi - user communication scenarios. Specifically, the existing FAC models have the following limitations:
1. **Lack of flexibility**: The existing FAC model architectures limit the flexible adjustment of the timbre of the generated speech.
2. **High latency**: It is necessary to accumulate context information, which leads to an increase in generation latency and is not suitable for real - time application scenarios.
3. **Dependence on paired data**: Many FAC models require L2 (second language) and L1 (native language) paired data in both the training and inference stages, which increases the cost and complexity of data collection and processing.
4. **Fixed accent set**: Some methods can only handle predefined accent sets and are difficult to adapt to newly emerging accents.
To solve these problems, the author proposes a **non - autoregressive real - time accent conversion model combined with voice cloning technology**. This model can convert L2 speech with a foreign accent into L1 speech that sounds like a native speaker in real - time without relying on reference examples and paired data, while retaining the unique timbre, gender and emotional characteristics of the speaker.
### Main features of the model
- **Modular design**: The model consists of multiple modules, including modules for extracting accents, genders and speaker embeddings, as well as modules for voice conversion, spectrogram generation and audio signal decoding.
- **Non - autoregressive structure**: It avoids the latency problems common in autoregressive models and improves real - time processing capabilities.
- **Real - time voice cloning**: It can save, clone and modify the timbre, gender and accent characteristics of the speaker in real - time.
- **High quality and low latency**: The experimental results show that the model has low generation latency while maintaining high quality, and is suitable for real - time multi - user communication scenarios.
### Experimental verification
The author verifies the effectiveness of the model through objective and subjective evaluations:
- **Objective evaluation**: Multiple Automatic Speech Recognition (ASR) models are used to test the original and converted audio, and the results show that the converted audio has significant improvements in both Word Error Rate (WER) and Character Error Rate (CER).
- **Subjective evaluation**: The naturalness of the converted speech, the similarity of the speaker and the degree of disappearance of the foreign accent are evaluated through human listener scores. The results show that the proposed model is superior to the Ablation Model in these aspects.
In summary, this paper aims to develop an efficient, flexible accent conversion model suitable for real - time applications to improve the quality and efficiency of cross - language communication.