Abstract:Accent conversion (AC) transforms a non-native speaker's accent into a native accent while maintaining the speaker's voice timbre. In this paper, we propose approaches to improving accent conversion applicability, as well as quality. First of all, we assume no reference speech is available at the conversion stage, and hence we employ an end-to-end text-to-speech system that is trained on native speech to generate native reference speech. To improve the quality and accent of the converted speech, we introduce reference encoders which make us capable of utilizing multi-source information. This is motivated by acoustic features extracted from native reference and linguistic information, which are complementary to conventional phonetic posteriorgrams (PPGs), so they can be concatenated as features to improve a baseline system based only on PPGs. Moreover, we optimize model architecture using GMM-based attention instead of windowed attention to elevate synthesized performance. Experimental results indicate when the proposed techniques are applied the integrated system significantly raises the scores of acoustic quality (30$\%$ relative increase in mean opinion score) and native accent (68$\%$ relative preference) while retaining the voice identity of the non-native speaker.

Voice Conversion with Smoothed GMM and MAP Adaptation

Voice conversion using dynamic inter-frame features

An improved method for voice conversion based on Gaussian mixture model

Voice Conversion Based on Gaussian Mixture Modules with Minimum Distance Spectral Mapping

A hybrid method to convert acoustic features for voice conversion

GMM-based Voice Conversion with Explicit Modelling on Feature Transform

An Improved Spectral And Prosodic Transformation Method In Straight-Based Voice Conversion

A hybrid GMM and codebook mapping method for spectral conversion

A novel voice conversion system based on codebook mapping with phoneme-tied weighting.

Text-Independent Voice Conversion Based on State Mapped Codebook

A Compact Framework For Voice Conversion Using Wavenet Conditioned On Phonetic Posteriorgrams

Improving the Performance of HMM-based Voice Conversion Using Context Clustering Decision Tree and Appropriate Regression Matrix Format.

NVCGAN: Leveraging Generative Adversarial Networks for Robust Voice Conversion

Voice Conversion Based on Speaker Independent Model

Voice Conversion towards Arbitrary Speakers With Limited Data.

RAVE for Speech: Efficient Voice Conversion at High Sampling Rates

Improving Accent Conversion with Reference Encoder and End-To-End Text-To-Speech

Residual Speaker Representation for One-Shot Voice Conversion

A Parametric Model for Voice Conversion

Spectro-Temporal Modelling with Time-Frequency Lstm and Structured Output Layer for Voice Conversion

An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning