Abstract:Deep speaker embedding learning based on neural networks has become the predominant approach in speaker verification (SV) currently. In prior studies, researchers have investigated various network architectures. However, rare works pay attention to the question of how to achieve a better trade-off on model performance and computational complexity. In this paper, we focus on efficient architecture design for speaker verification. Firstly, we systematically study the effect of the network depth and width on performance and empirically discover that depth is more important than the width of networks for speaker verification task. Based on this observation, we propose a novel depth-first (DF) architecture design rule. By applying it to ResNet and ECAPA-TDNN, two new families of much deeper models, namely DF-ResNets and DF-ECAPAs, are constructed. In addition, to further boost the performance of small models in the low computation regime, two novel attentive feature fusion (AFF) schemes, including sequential AFF (S-AFF) and parallel AFF (P-AFF), are proposed to dynamically fuse features in a learnable way. Experimental results on the VoxCeleb dataset show that the newly proposed DF-ResNets and DF-ECAPAs can achieve a much better trade-off on performance and complexity than the original ResNet and ECAPA-TDNN. Moreover, small models can further obtain up to 40% relative improvement in EER by adopting AFF scheme with negligible computational cost. Finally, a comprehensive comparison with various other published SV systems illustrates that our proposed models achieve the best trade-off on performance and complexity in both low and high computation scenarios.

Deep Embedding Learning for Text-Dependent Speaker Verification.

An Effective Deep Embedding Learning Architecture for Speaker Verification.

Deep Speaker Embedding Learning with Multi-level Pooling for Text-independent Speaker Verification

An Effective Deep Embedding Learning Method Based on Dense-Residual Networks for Speaker Verification

End-to-End Feature Learning for Text-Independent Speaker Verification

Discriminative Neural Embedding Learning for Short-Duration Text-Independent Speaker Verification.

An Improved Deep Embedding Learning Method for Short Duration Speaker Verification

Deep Speaker Feature Learning for Text-independent Speaker Verification

Deep neural network-based speaker embeddings for end-to-end speaker verification

Depth-First Neural Architecture with Attentive Feature Fusion for Efficient Speaker Verification.

Learning Discriminative Speaker Embedding by Improving Aggregation Strategy and Loss Function for Speaker Verification

Learning Deep Embedding with Acoustic and Phoneme Features for Speaker Recognition in FM Broadcasting

Improving Aggregation and Loss Function for Better Embedding Learning in End-to-End Speaker Verification System.

ECAPA++: Fine-grained Deep Embedding Learning for TDNN Based Speaker Verification

Speaker Verification With Deep Features

Deep Speaker Embedding with Multi-Part Information Aggregation in Frequency-Time Domain for ASV

Improving the Generalized Performance of Deep Embedding for Text-Independent Speaker Verification

Deep Segment Attentive Embedding for Duration Robust Speaker Verification

Deep Speaker Embedding Using Hybrid Network of Multi-Feature Aggregation and Multi-Loss Fusion for TI-SV

Deep Speaker Vectors for Semi Text-independent Speaker Verification

An Effective Speaker Recognition Method Based on Joint Identification and Verification Supervisions.