Neighborhood Attention Transformer with Progressive Channel Fusion for Speaker Verification

Nian Li,Jianguo Wei
2024-05-30
Abstract:Transformer-based architectures for speaker verification typically require more training data than ECAPA-TDNN. Therefore, recent work has generally been trained on VoxCeleb1&2. We propose a backbone network based on self-attention, which can achieve competitive results when trained on VoxCeleb2 alone. The network alternates between neighborhood attention and global attention to capture local and global features, then aggregates features of different hierarchical levels, and finally performs attentive statistics pooling. Additionally, we employ a progressive channel fusion strategy to expand the receptive field in the channel dimension as the network deepens. We trained the proposed PCF-NAT model on VoxCeleb2 and evaluated it on VoxCeleb1 and the validation sets of VoxSRC. The EER and minDCF of the shallow PCF-NAT are on average more than 20% lower than those of similarly sized ECAPA-TDNN. Deep PCF-NAT achieves an EER lower than 0.5% on VoxCeleb1-O. The code and models are publicly available at <a class="link-external link-https" href="https://github.com/ChenNan1996/PCF-NAT" rel="external noopener nofollow">this https URL</a>.
Sound,Audio and Speech Processing
What problem does this paper attempt to address?
This paper mainly explores a new neural network architecture, called Neighborhood Attention Transformer (NAT), and a progressive channel fusion strategy (PCF) for speaker verification. In the study, the authors point out that the traditional Transformer architecture requires a large amount of training data for speaker verification tasks, while ECAPA-TDNN typically uses the VoxCeleb1&2 dataset. Therefore, they propose a new NAT model that alternates between neighborhood attention and global attention to capture local and global features, and combines features at different levels. In addition, they draw on the progressive channel fusion strategy of PCF-ECAPA-TDNN to increase the channel dimensionality with increasing network depth. The paper also introduces a variant called PCF-NAT, which extends the receptive field by using 1D convolutions and gradually decreasing the group size to improve performance. Experimental results show that compared to similarly sized ECAPA-TDNN, the shallow PCF-NAT reduces the EER and minDCF by over 20% on average, and the deep PCF-NAT achieves an EER below 0.5% on VoxCeleb1-O. Furthermore, PCF-NAT also performs well on the verification set of VoxSRC and exhibits good scalability. The paper concludes with an ablation study that validates the effectiveness of various components in the model, and discusses future directions, including exploring more efficient downsampling methods and training PCF-NAT on larger datasets to evaluate its generalization ability.