Abstract:Multi-speaker text-to-speech synthesis involves generating unique speech patterns for individual speakers based on reference waveforms and input sequences of graphemes or phonemes. Various deep neural networks are trained for this task using a large amount of speech data recorded from a specific speaker to generate audio in their voice. The model requires a large dataset to retrain itself and learn about a new speaker not seen during training. This process is expensive in terms of time and resources. Thus, a key requirement of such techniques is to reduce time and resource consumption. In this paper, a multi-speaker text-to-speech synthesis using a generalized end-to-end loss function is developed, capable of generating speech in real-time for a given speech reference from a user and a text string as input. This method considers the speaker's characteristics in the generated speech using the speech reference of their voice. The proposed method also assesses the effect on spontaneity and fluency in the generated language, corresponding to the speaker encoder, using the mean opinion score (MOS). However, a speaker encoder is trained with varying hours of the audio dataset, and it observes the effect on the produced speech. Furthermore, an extensive analysis is performed on the impact of the training dataset on the speaker encoder, corresponding to the generated speech, and various speaker encoder models for the speaker verification task. Based on loss function and Equal Error Rate (EER), advanced GRU is selected for generalized end-to-end loss function. The speaker verification regression test represents that the projected prototype can generate language, which the regression algorithm is able to distinguish into two sets: male and female while second test shows the above technique is able to distinguish speaker embeddings separately in clusters showing each speaker is uniquely identified. In terms of results, our proposed model achieved a MOS of 4.02 when trained on 'Train Clean 100', 3.74 on 'Train-clean-360', and 3.25 on 'Train-clean-500'. The MOS test juxtaposes our method with prior models, demonstrating its superior performance. Conclusively, a cross-similarity matrix offers a visual representation of the similarity and disparity between utterances, underscoring the model's robustness and efficacy.

Property-Aware Multi-Speaker Data Simulation: A Probabilistic Modelling Technique for Synthetic Data Generation

SonicSim: A customizable simulation platform for speech processing in moving sound source scenarios

Fake it to make it: Using synthetic data to remedy the data shortage in joint multimodal speech-and-gesture synthesis

Realistic multi-microphone data simulation for distant speech recognition

Training Multi-Speaker Neural Text-to-Speech Systems using Speaker-Imbalanced Speech Corpora

Generating Data with Text-to-Speech and Large-Language Models for Conversational Speech Recognition

Improving Generalization of Speech Separation in Real-World Scenarios: Strategies in Simulation, Optimization, and Evaluation

Msdtron: a high-capability multi-speaker speech synthesis system for diverse data using characteristic information

Data Efficient Child-Adult Speaker Diarization with Simulated Conversations

Effective Noise-aware Data Simulation for Domain-adaptive Speech Enhancement Leveraging Dynamic Stochastic Perturbation

SpeechAgents: Human-Communication Simulation with Multi-Modal Multi-Agent Systems

Probabilistic Speech-Driven 3D Facial Motion Synthesis: New Benchmarks, Methods, and Applications

Facetron: A Multi-speaker Face-to-Speech Model based on Cross-modal Latent Representations

Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis Using Discrete Speech Representation

Employing Deep Learning Model to Evaluate Speech Information in Acoustic Simulations of Auditory Implants

SONAR: A Synthetic AI-Audio Detection Framework and Benchmark

Utilizing Speaker Profiles for Impersonation Audio Detection

Multi speaker text-to-speech synthesis using generalized end-to-end loss function

ELF: Encoding Speaker-Specific Latent Speech Feature for Speech Synthesis

3D-Speaker: A Large-Scale Multi-Device, Multi-Distance, and Multi-Dialect Corpus for Speech Representation Disentanglement

Statistical parametric speech synthesis using generative adversarial networks under a multi-task learning framework