Multi-speaker Text-to-speech Training with Speaker Anonymized Data

Wen-Chin Huang,Yi-Chiao Wu,Tomoki Toda

2024-05-20

Abstract:The trend of scaling up speech generation models poses a threat of biometric information leakage of the identities of the voices in the training data, raising privacy and security concerns. In this paper, we investigate training multi-speaker text-to-speech (TTS) models using data that underwent speaker anonymization (SA), a process that tends to hide the speaker identity of the input speech while maintaining other attributes. Two signal processing-based and three deep neural network-based SA methods were used to anonymize VCTK, a multi-speaker TTS dataset, which is further used to train an end-to-end TTS model, VITS, to perform unseen speaker TTS during the testing phase. We conducted extensive objective and subjective experiments to evaluate the anonymized training data, as well as the performance of the downstream TTS model trained using those data. Importantly, we found that UTMOS, a data-driven subjective rating predictor model, and GVD, a metric that measures the gain of voice distinctiveness, are good indicators of the downstream TTS performance. We summarize insights in the hope of helping future researchers determine the goodness of the SA system for multi-speaker TTS training.

Audio and Speech Processing,Cryptography and Security,Sound

What problem does this paper attempt to address?

The paper aims to address the issue of biometric information leakage during the training process of multi-speaker text-to-speech (TTS) systems. As the scale of speech generation models continues to expand, there is a risk that the model may memorize parts of the training data, leading to privacy and security concerns. For example, the voice of a speaker used in the training data could be maliciously exploited to deceive voice authentication systems. To tackle this challenge, researchers have proposed a solution that involves using data processed through speaker anonymization (SA) to train the speech synthesis model. Specifically, the paper employs two signal processing-based methods and three deep neural network-based methods to anonymize the multi-speaker TTS dataset VCTK. These anonymized data are then used to train an end-to-end TTS model, VITS. Extensive objective and subjective experiments were conducted to evaluate the performance of the anonymized training data and the downstream TTS model trained with this data. The study found that UTMOS (a data-driven subjective score prediction model) and GVD (a metric for measuring voice distinctiveness) can effectively indicate the performance of downstream TTS tasks. In summary, the paper explores how to train high-quality multi-speaker speech synthesis models while protecting speaker privacy and provides guidance for future research.

Multi-speaker Text-to-speech Training with Speaker Anonymized Data

Multi-Speaker Text-to-Speech Training With Speaker Anonymized Data

Speaker Anonymization for Personal Information Protection Using Voice Conversion Techniques

Distinctive and Natural Speaker Anonymization via Singular Value Transformation-assisted Matrix

Language-Independent Speaker Anonymization Approach using Self-Supervised Pre-Trained Models

Reprogramming Self-supervised Learning-based Speech Representations for Speaker Anonymization

Analyzing Language-Independent Speaker Anonymization Framework under Unseen Conditions

A Benchmark for Multi-speaker Anonymization

Anonymizing Speech: Evaluating and Designing Speaker Anonymization Techniques

Distinguishable Speaker Anonymization Based on Formant and Fundamental Frequency Scaling

Two-Stage Voice Anonymization for Enhanced Privacy

V-Cloak: Intelligibility-, Naturalness- & Timbre-Preserving Real-Time Voice Anonymization

Self-Supervised Speech Representations Preserve Speech Characteristics while Anonymizing Voices

Anonymization of Voices in Spaces for Civic Dialogue: Measuring Impact on Empathy, Trust, and Feeling Heard

Exploring VQ-VAE with Prosody Parameters for Speaker Anonymization

Speaker anonymization using orthogonal Householder neural network

GhostVec: A New Threat to Speaker Privacy of End-to-End Speech Recognition System

Evaluation of Speaker Anonymization on Emotional Speech

NPU-NTU System for Voice Privacy 2024 Challenge

Adversarial speech for voice privacy protection from Personalized Speech generation

MUSA: Multi-lingual Speaker Anonymization via Serial Disentanglement