Abstract:Low resource of parallel data is the key challenge of accent conversion(AC) problem in which both the pronunciation units and prosody pattern need to be converted. We propose a two-stage generative framework "convert-and-speak" in which the conversion is only operated on the semantic token level and the speech is synthesized conditioned on the converted semantic token with a speech generative model in target accent domain. The decoupling design enables the "speaking" module to use massive amount of target accent speech and relieves the parallel data required for the "conversion" module. Conversion with the bridge of semantic token also relieves the requirement for the data with text transcriptions and unlocks the usage of language pre-training technology to further efficiently reduce the need of parallel accent speech data. To reduce the complexity and latency of "speaking", a single-stage AR generative model is designed to achieve good quality as well as lower computation cost. Experiments on Indian-English to general American-English conversion show that the proposed framework achieves state-of-the-art performance in accent similarity, speech quality, and speaker maintenance with only 15 minutes of weakly parallel data which is not constrained to the same speaker. Extensive experimentation with diverse accent types suggests that this framework possesses a high degree of adaptability, making it readily scalable to accommodate other accents with low-resource data. Audio samples are available at <a class="link-external link-https" href="https://www.microsoft.com/en-us/research/project/convert-and-speak-zero-shot-accent-conversion-with-minimumsupervision/" rel="external noopener nofollow">this https URL</a>.

End-To-End Accent Conversion Without Using Native Utterances

Improving Accent Conversion with Reference Encoder and End-To-End Text-To-Speech

Improving Accent Conversion with Reference Encoder and End-To-End Text-To-Speech

Accent conversion using discrete units with parallel data synthesized from controllable accented TTS

End-to-end accent conversion method

TTS-Guided Training for Accent Conversion Without Parallel Data

Transfer the linguistic representations from TTS to accent conversion with non-parallel data

Convert and Speak: Zero-shot Accent Conversion with Minimum Supervision

A New Approach to Accent Recognition and Conversion for Mandarin Chinese

Zero-Shot Accent Conversion using Pseudo Siamese Disentanglement Network

Accent Conversion with Articulatory Representations

Voice-preserving Zero-shot Multiple Accent Conversion

Improving Pronunciation and Accent Conversion through Knowledge Distillation And Synthetic Ground-Truth from Native TTS

MacST: Multi-Accent Speech Synthesis via Text Transliteration for Accent Conversion

Disentangling segmental and prosodic factors to non-native speech comprehensibility

Evaluating Methods for Ground-Truth-Free Foreign Accent Conversion

Synthetic Cross-accent Data Augmentation for Automatic Speech Recognition

Accent Recognition with Hybrid Phonetic Features

End-to-end Code-switched TTS with Mix of Monolingual Recordings.

Non-autoregressive real-time Accent Conversion model with voice cloning

AccentBox: Towards High-Fidelity Zero-Shot Accent Generation