Abstract:The speaker encoder is an important front-end module that explores discriminative speaker features for many speech applications requiring speaker information. Current speaker encoders aggregate multi-scale features from utterances using multi-branch network architectures. However, naively adding many branches through a fully convolutional operation cannot efficiently improve its capability to capture multi-scale features due to the problem of rapid increase of model parameters and computational complexity. Therefore, in current network architectures, only a few branches corresponding to a limited number of temporal scales are designed for capturing speaker features. To address this problem, this paper proposes an effective temporal multi-scale (TMS) model where multi-scale branches could be efficiently designed in a speaker encoder while negligibly increasing computational costs. The TMS model is based on a time-delay neural network (TDNN), where the network architecture is separated into channel-modeling and temporal multi-branch modeling operators. In the TMS model, adding temporal multi-scale elements in the temporal multi-branch operator only slightly increases the model's parameters, thus saving more of the computational budget to add branches with large temporal scales. After model training, we further develop a systemic re-parameterization method to convert the multi-branch network topology into a single-path-based topology to increase the inference speed.We conducted automatic speaker verification (ASV) experiments under in-domain (VoxCeleb) and out-of-domain (CNCeleb) conditions to investigate the proposed TMS model's performance.Experimental results show that the TMS-method-based model outperformed state-of-the-art ASV models (e.g., ECAPA-TDNN) and improved robustness. Moreover, the proposed model achieved a 29%–46% increase in the inference speed compared to ECAPA-TDNN.

Branch-ECAPA-TDNN: A Parallel Branch Architecture to Capture Local and Global Features for Speaker Verification

ECAPA++: Fine-grained Deep Embedding Learning for TDNN Based Speaker Verification

PCF: ECAPA-TDNN with Progressive Channel Fusion for Speaker Verification

DFR-ECAPA: Diffusion Feature Refinement for Speaker Verification Based on ECAPA-TDNN.

DS-TDNN: Dual-stream Time-delay Neural Network with Global-aware Filter for Speaker Verification

NeXt-TDNN: Modernizing Multi-Scale Temporal Convolution Backbone for Speaker Verification

TMS: Temporal multi-scale in time-delay neural network for speaker verification

Depth-First Neural Architecture with Attentive Feature Fusion for Efficient Speaker Verification.

Neighborhood Attention Transformer with Progressive Channel Fusion for Speaker Verification

P-vectors: A Parallel-Coupled TDNN/Transformer Network for Speaker Verification

Combination of Multiple Embeddings for Speaker Retrieval

Improving ECAPA-TDNN Performance with Coordinate Attention

Branch-Transformer: A Parallel Branch Architecture to Capture Local and Global Features for Language Identification

End-to-End Feature Learning for Text-Independent Speaker Verification

CAM++: A Fast and Efficient Network for Speaker Verification Using Context-Aware Masking

Dual-model self-regularization and fusion for domain adaptation of robust speaker verification

An Effective Deep Embedding Learning Architecture for Speaker Verification.

Dilated Residual Networks with Multi-Level Attention for Speaker Verification

Dual Path Embedding Learning for Speaker Verification with Triplet Attention

ACA-Net: Towards Lightweight Speaker Verification using Asymmetric Cross Attention

ECAPA2: A Hybrid Neural Network Architecture and Training Strategy for Robust Speaker Embeddings