Abstract: Speaker embedding is an important front-end module to explore discriminative speaker features for many speech applications where speaker information is needed. Current SOTA backbone networks for speaker embedding are designed to aggregate multi-scale features from an utterance with multi-branch network architectures for speaker representation. However, naively adding many branches of multi-scale features with the simple fully convolutional operation could not efficiently improve the performance due to the rapid increase of model parameters and computational complexity. Therefore, in the most current state-of-the-art network architectures, only a few branches corresponding to a limited number of temporal scales could be designed for speaker embeddings. To address this problem, in this paper, we propose an effective temporal multi-scale (TMS) model where multi-scale branches could be efficiently designed in a speaker embedding network almost without increasing computational costs. The new model is based on the conventional TDNN, where the network architecture is smartly separated into two modeling operators: a channel-modeling operator and a temporal multi-branch modeling operator. Adding temporal multi-scale in the temporal multi-branch operator needs only a little bit increase of the number of parameters, and thus save more computational budget for adding more branches with large temporal scales. Moreover, in the inference stage, we further developed a systemic re-parameterization method to convert the TMS-based model into a single-path-based topology in order to increase inference speed. We investigated the performance of the new TMS method for automatic speaker verification (ASV) on in-domain and out-of-domain conditions. Results show that the TMS-based model obtained a significant increase in the performance over the SOTA ASV models, meanwhile, had a faster inference speed.

Few-Shot Custom Speech Synthesis with Multi-Angle Fusion

TDASS: Target Domain Adaptation Speech Synthesis Framework for Multi-speaker Low-Resource TTS

Optimizing feature fusion for improved zero-shot adaptation in text-to-speech synthesis

Multispeech: Multi-speaker text to speech with transformer

Modeling Multi-speaker Latent Space to Improve Neural TTS: Quick Enrolling New Speaker and Enhancing Premium Voice

VF-Taco2: Towards Fast and Lightweight Synthesis for Autoregressive Models with Variation Autoencoder and Feature Distillation.

Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

Deep Voice 2: Multi-Speaker Neural Text-to-Speech

HAM-TTS: Hierarchical Acoustic Modeling for Token-Based Zero-Shot Text-to-Speech with Model and Data Scaling

The THU-HCSI Multi-Speaker Multi-Lingual Few-Shot Voice Cloning System for LIMMITS'24 Challenge

Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts

Training Multi-Speaker Neural Text-to-Speech Systems using Speaker-Imbalanced Speech Corpora

Deep Learning based Multilingual Speech Synthesis using Multi Feature Fusion Methods

AdaptiveFormer: A Few-shot Speaker Adaptive Speech Synthesis Model Based on FastSpeech2

Towards High-Quality Neural TTS for Low-Resource Languages by Learning Compact Speech Representations

VCVTS: Multi-speaker Video-to-Speech synthesis via cross-modal knowledge transfer from voice conversion

A Multi-task Framework of Speaker Recognition with TTS Data Augmentation

High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models

TMS: A Temporal Multi-scale Backbone Design for Speaker Embedding

DART: Disentanglement of Accent and Speaker Representation in Multispeaker Text-to-Speech

Takin: A Cohort of Superior Quality Zero-shot Speech Generation Models