Abstract:The speaker encoder is an important front-end module that explores discriminative speaker features for many speech applications requiring speaker information. Current speaker encoders aggregate multi-scale features from utterances using multi-branch network architectures. However, naively adding many branches through a fully convolutional operation cannot efficiently improve its capability to capture multi-scale features due to the problem of rapid increase of model parameters and computational complexity. Therefore, in current network architectures, only a few branches corresponding to a limited number of temporal scales are designed for capturing speaker features. To address this problem, this paper proposes an effective temporal multi-scale (TMS) model where multi-scale branches could be efficiently designed in a speaker encoder while negligibly increasing computational costs. The TMS model is based on a time-delay neural network (TDNN), where the network architecture is separated into channel-modeling and temporal multi-branch modeling operators. In the TMS model, adding temporal multi-scale elements in the temporal multi-branch operator only slightly increases the model's parameters, thus saving more of the computational budget to add branches with large temporal scales. After model training, we further develop a systemic re-parameterization method to convert the multi-branch network topology into a single-path-based topology to increase the inference speed.We conducted automatic speaker verification (ASV) experiments under in-domain (VoxCeleb) and out-of-domain (CNCeleb) conditions to investigate the proposed TMS model's performance.Experimental results show that the TMS-method-based model outperformed state-of-the-art ASV models (e.g., ECAPA-TDNN) and improved robustness. Moreover, the proposed model achieved a 29%–46% increase in the inference speed compared to ECAPA-TDNN.

Combination of Multiple Embeddings for Speaker Retrieval

Deep Speaker Embedding Learning with Multi-level Pooling for Text-independent Speaker Verification

Deep Speaker: an End-to-End Neural Speaker Embedding System

Fusion of Embeddings Networks for Robust Combination of Text Dependent and Independent Speaker Recognition

An Improved Deep Neural Network for Modeling Speaker Characteristics at Different Temporal Scales.

Joint speaker encoder and neural back-end model for fully end-to-end automatic speaker verification with multiple enrollment utterances

Multi-feature Combination for Speaker Recognition

Multi-Level Speaker Representation for Target Speaker Extraction

A text-dependent speaker verification application framework based on Chinese numerical string corpus

ECAPA2: A Hybrid Neural Network Architecture and Training Strategy for Robust Speaker Embeddings

Double Multi-Head Attention for Speaker Verification

Analyzing And Improving Neural Speaker Embeddings for ASR

TMS: Temporal multi-scale in time-delay neural network for speaker verification

Combination of Deep Speaker Embeddings for Diarisation

Post-Training Embedding Alignment for Decoupling Enrollment and Runtime Speaker Recognition Models

Towards Robust Speaker Verification with Target Speaker Enhancement

Multi-View Speaker Embedding Learning for Enhanced Stability and Discriminability.

Multi-Task Learning with High-Order Statistics for X-vector Based Text-Independent Speaker Verification

Cross-Speaker Encoding Network for Multi-Talker Speech Recognition

Bidirectional Attention For Text-Dependent Speaker Verification

Supervised Speaker Embedding De-Mixing in Two-Speaker Environment