Voice-preserving Zero-shot Multiple Accent Conversion

Mumin Jin,Prashant Serai,Jilong Wu,Andros Tjandra,Vimal Manohar,Qing He
2023-10-14
Abstract:Most people who have tried to learn a foreign language would have experienced difficulties understanding or speaking with a native speaker's accent. For native speakers, understanding or speaking a new accent is likewise a difficult task. An accent conversion system that changes a speaker's accent but preserves that speaker's voice identity, such as timbre and pitch, has the potential for a range of applications, such as communication, language learning, and entertainment. Existing accent conversion models tend to change the speaker identity and accent at the same time. Here, we use adversarial learning to disentangle accent dependent features while retaining other acoustic characteristics. What sets our work apart from existing accent conversion models is the capability to convert an unseen speaker's utterance to multiple accents while preserving its original voice identity. Subjective evaluations show that our model generates audio that sound closer to the target accent and like the original speaker.
Sound,Artificial Intelligence,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **How to convert any speech with different accents into multiple target accents while retaining the original voice characteristics of the speaker (such as timbre and pitch)**. Specifically, existing accent conversion models usually change the identity characteristics and accents of the speaker at the same time, resulting in the converted speech not sounding like the original speaker. In addition, many existing models require additional training for each new speaker or rely on reference pronunciations, which limits their application scope. To solve these problems, the author proposes an adversarial - learning - based method to achieve zero - shot multi - accent conversion by decoupling accent - related features and other acoustic features. This method can convert the speech of unseen speakers into multiple different target accents without changing the identity of the speaker. This technology has broad application prospects, including cross - cultural communication, language learning, and entertainment fields. ### Main contributions: 1. **Achieve zero - shot multi - accent conversion for the first time**: It can convert any speech with different accents into multiple target accents without changing non - accent - related voice characteristics. 2. **No need for text labels or speaker ID labels**: During the training process, there is no need for text labels or speaker ID labels related to accented speech, although a pre - trained ASR model is used to extract language features. 3. **Synchronous conversion**: Keep the output in sync with the input, which is suitable for application scenarios such as video dubbing. ### Method overview: - **Pronunciation Encoder**: Used to generate pronunciation sequences related to specific accents. - **Acoustic Encoder**: Remove accent information through adversarial training and retain other acoustic features. - **HiFiGAN Decoder**: Recombine the processed features and generate audio waveforms. ### Experimental results: - **Audio quality**: Listeners' quality scores for the converted audio are close to those of the original audio, indicating that the model can well preserve the audio quality. - **Speaker similarity**: Most listeners think that the converted audio sounds very similar to the original speaker. - **Accent conversion effect**: The trained model performs better than the baseline model in accent conversion. Especially when converting accents to American accents, listeners are more likely to think that the converted audio sounds more like the target accent. Overall, this research provides an innovative solution that can achieve high - quality multi - accent conversion while retaining the speaker's identity.