Depth-First Neural Architecture with Attentive Feature Fusion for Efficient Speaker Verification.

Bei Liu,Zhengyang Chen,Yanmin Qian
DOI: https://doi.org/10.1109/taslp.2023.3273417
2023-01-01
IEEE/ACM Transactions on Audio Speech and Language Processing
Abstract:Deep speaker embedding learning based on neural networks has become the predominant approach in speaker verification (SV) currently. In prior studies, researchers have investigated various network architectures. However, rare works pay attention to the question of how to achieve a better trade-off on model performance and computational complexity. In this paper, we focus on efficient architecture design for speaker verification. Firstly, we systematically study the effect of the network depth and width on performance and empirically discover that depth is more important than the width of networks for speaker verification task. Based on this observation, we propose a novel depth-first (DF) architecture design rule. By applying it to ResNet and ECAPA-TDNN, two new families of much deeper models, namely DF-ResNets and DF-ECAPAs, are constructed. In addition, to further boost the performance of small models in the low computation regime, two novel attentive feature fusion (AFF) schemes, including sequential AFF (S-AFF) and parallel AFF (P-AFF), are proposed to dynamically fuse features in a learnable way. Experimental results on the VoxCeleb dataset show that the newly proposed DF-ResNets and DF-ECAPAs can achieve a much better trade-off on performance and complexity than the original ResNet and ECAPA-TDNN. Moreover, small models can further obtain up to 40% relative improvement in EER by adopting AFF scheme with negligible computational cost. Finally, a comprehensive comparison with various other published SV systems illustrates that our proposed models achieve the best trade-off on performance and complexity in both low and high computation scenarios.
What problem does this paper attempt to address?