Abstract:Most people who have tried to learn a foreign language would have experienced difficulties understanding or speaking with a native speaker's accent. For native speakers, understanding or speaking a new accent is likewise a difficult task. An accent conversion system that changes a speaker's accent but preserves that speaker's voice identity, such as timbre and pitch, has the potential for a range of applications, such as communication, language learning, and entertainment. Existing accent conversion models tend to change the speaker identity and accent at the same time. Here, we use adversarial learning to disentangle accent dependent features while retaining other acoustic characteristics. What sets our work apart from existing accent conversion models is the capability to convert an unseen speaker's utterance to multiple accents while preserving its original voice identity. Subjective evaluations show that our model generates audio that sound closer to the target accent and like the original speaker.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **How to convert any speech with different accents into multiple target accents while retaining the original voice characteristics of the speaker (such as timbre and pitch)**. Specifically, existing accent conversion models usually change the identity characteristics and accents of the speaker at the same time, resulting in the converted speech not sounding like the original speaker. In addition, many existing models require additional training for each new speaker or rely on reference pronunciations, which limits their application scope. To solve these problems, the author proposes an adversarial - learning - based method to achieve zero - shot multi - accent conversion by decoupling accent - related features and other acoustic features. This method can convert the speech of unseen speakers into multiple different target accents without changing the identity of the speaker. This technology has broad application prospects, including cross - cultural communication, language learning, and entertainment fields. ### Main contributions: 1. **Achieve zero - shot multi - accent conversion for the first time**: It can convert any speech with different accents into multiple target accents without changing non - accent - related voice characteristics. 2. **No need for text labels or speaker ID labels**: During the training process, there is no need for text labels or speaker ID labels related to accented speech, although a pre - trained ASR model is used to extract language features. 3. **Synchronous conversion**: Keep the output in sync with the input, which is suitable for application scenarios such as video dubbing. ### Method overview: - **Pronunciation Encoder**: Used to generate pronunciation sequences related to specific accents. - **Acoustic Encoder**: Remove accent information through adversarial training and retain other acoustic features. - **HiFiGAN Decoder**: Recombine the processed features and generate audio waveforms. ### Experimental results: - **Audio quality**: Listeners' quality scores for the converted audio are close to those of the original audio, indicating that the model can well preserve the audio quality. - **Speaker similarity**: Most listeners think that the converted audio sounds very similar to the original speaker. - **Accent conversion effect**: The trained model performs better than the baseline model in accent conversion. Especially when converting accents to American accents, listeners are more likely to think that the converted audio sounds more like the target accent. Overall, this research provides an innovative solution that can achieve high - quality multi - accent conversion while retaining the speaker's identity.

Voice-preserving Zero-shot Multiple Accent Conversion

Zero-Shot Accent Conversion using Pseudo Siamese Disentanglement Network

Convert and Speak: Zero-shot Accent Conversion with Minimum Supervision

Accent conversion using discrete units with parallel data synthesized from controllable accented TTS

Transfer the linguistic representations from TTS to accent conversion with non-parallel data

Residual Speaker Representation for One-Shot Voice Conversion

Improving Accent Conversion with Reference Encoder and End-To-End Text-To-Speech

Accent Conversion in Text-To-Speech Using Multi-Level VAE and Adversarial Training

Seeing Your Speech Style: A Novel Zero-Shot Identity-Disentanglement Face-based Voice Conversion

TTS-Guided Training for Accent Conversion Without Parallel Data

An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning

ACE-VC: Adaptive and Controllable Voice Conversion using Explicitly Disentangled Self-supervised Speech Representations

DEFENDING YOUR VOICE: ADVERSARIAL ATTACK ON VOICE CONVERSION

End-To-End Accent Conversion Without Using Native Utterances

Enhancing Polyglot Voices by Leveraging Cross-Lingual Fine-Tuning in Any-to-One Voice Conversion

Improving Accent Conversion with Reference Encoder and End-To-End Text-To-Speech

LM-VC: Zero-Shot Voice Conversion via Speech Generation Based on Language Models

End-to-end accent conversion method

Accent Conversion with Articulatory Representations

A Unified Model For Voice and Accent Conversion In Speech and Singing using Self-Supervised Learning and Feature Extraction