Seeing Your Speech Style: A Novel Zero-Shot Identity-Disentanglement Face-based Voice Conversion

Yan Rong,Li Liu

2024-09-01

Abstract:Face-based Voice Conversion (FVC) is a novel task that leverages facial images to generate the target speaker's voice style. Previous work has two shortcomings: (1) suffering from obtaining facial embeddings that are well-aligned with the speaker's voice identity information, and (2) inadequacy in decoupling content and speaker identity information from the audio input. To address these issues, we present a novel FVC method, Identity-Disentanglement Face-based Voice Conversion (ID-FaceVC), which overcomes the above two limitations. More precisely, we propose an Identity-Aware Query-based Contrastive Learning (IAQ-CL) module to extract speaker-specific facial features, and a Mutual Information-based Dual Decoupling (MIDD) module to purify content features from audio, ensuring clear and high-quality voice conversion. Besides, unlike prior works, our method can accept either audio or text inputs, offering controllable speech generation with adjustable emotional tone and speed. Extensive experiments demonstrate that ID-FaceVC achieves state-of-the-art performance across various metrics, with qualitative and user study results confirming its effectiveness in naturalness, similarity, and diversity. Project website with audio samples and code can be found at <a class="link-external link-https" href="https://id-facevc.github.io" rel="external noopener nofollow">this https URL</a>.

Sound,Artificial Intelligence,Audio and Speech Processing

What problem does this paper attempt to address?

The paper aims to address two main issues in Face-based Voice Conversion (FVC): 1. How to obtain facial embeddings that align with the speaker's voice characteristics; 2. How to disentangle content information and speaker identity information from the audio input. Specifically, existing methods have shortcomings in handling these two challenges, resulting in generated voice quality that is not high and lacks personalized features. To solve these problems, the authors propose a new zero-shot face-based voice conversion method—Identity-Disentanglement Face-based Voice Conversion (ID-FaceVC). This method includes the following key components: 1. **Identity-Aware Query-based Contrastive Learning (IAQ-CL)**: Used to accurately extract facial features that are highly related to the speaker's identity, avoiding interference from non-specific information. 2. **Mutual Information-based Dual Decoupling (MIDD)**: Achieves effective separation of content information and speaker identity information by decomposing the speech signal, thereby improving the quality of the generated voice. In addition, this method supports audio or text as input and allows users to adjust the emotional tone and speed of the generated voice, providing higher flexibility and controllability. Experimental results show that ID-FaceVC achieves the best performance on various evaluation metrics, validating its effectiveness in terms of naturalness, similarity, and diversity.

Seeing Your Speech Style: A Novel Zero-Shot Identity-Disentanglement Face-based Voice Conversion

Face-Driven Zero-Shot Voice Conversion with Memory-based Face-Voice Alignment

HybridVC: Efficient Voice Style Conversion with Text and Audio Prompts

Zero-shot voice conversion based on feature disentanglement

ACE-VC: Adaptive and Controllable Voice Conversion using Explicitly Disentangled Self-supervised Speech Representations

Improving Zero-shot Voice Style Transfer via Disentangled Representation Learning

Diff-HierVC: Diffusion-based Hierarchical Voice Conversion with Robust Pitch Generation and Masked Prior for Zero-shot Speaker Adaptation

Converting Anyone's Voice: End-to-End Expressive Voice Conversion with a Conditional Diffusion Model

Who is Authentic Speaker

Voice-preserving Zero-shot Multiple Accent Conversion

An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning

Hear Your Face: Face-based voice conversion with F0 estimation

CTEFM-VC: Zero-Shot Voice Conversion Based on Content-Aware Timbre Ensemble Modeling and Flow Matching

VCVTS: Multi-speaker Video-to-Speech synthesis via cross-modal knowledge transfer from voice conversion

Beyond Voice Identity Conversion: Manipulating Voice Attributes by Adversarial Learning of Structured Disentangled Representations

One-shot Voice Conversion For Style Transfer Based On Speaker Adaptation

Phoneme Hallucinator: One-shot Voice Conversion via Set Expansion

EXPRESSIVE VOICE CONVERSION: A JOINT FRAMEWORK FOR SPEAKER IDENTITY AND EMOTIONAL STYLE TRANSFER

Speech Representation Disentanglement with Adversarial Mutual Information Learning for One-shot Voice Conversion.

StableVC: Style Controllable Zero-Shot Voice Conversion with Conditional Flow Matching

Speaking Style Conversion in the Waveform Domain Using Discrete Self-Supervised Units