Neighborhood Attention Transformer with Progressive Channel Fusion for Speaker Verification

Nian Li,Jianguo Wei

2024-05-30

Abstract:Transformer-based architectures for speaker verification typically require more training data than ECAPA-TDNN. Therefore, recent work has generally been trained on VoxCeleb1&2. We propose a backbone network based on self-attention, which can achieve competitive results when trained on VoxCeleb2 alone. The network alternates between neighborhood attention and global attention to capture local and global features, then aggregates features of different hierarchical levels, and finally performs attentive statistics pooling. Additionally, we employ a progressive channel fusion strategy to expand the receptive field in the channel dimension as the network deepens. We trained the proposed PCF-NAT model on VoxCeleb2 and evaluated it on VoxCeleb1 and the validation sets of VoxSRC. The EER and minDCF of the shallow PCF-NAT are on average more than 20% lower than those of similarly sized ECAPA-TDNN. Deep PCF-NAT achieves an EER lower than 0.5% on VoxCeleb1-O. The code and models are publicly available at <a class="link-external link-https" href="https://github.com/ChenNan1996/PCF-NAT" rel="external noopener nofollow">this https URL</a>.

Sound,Audio and Speech Processing

What problem does this paper attempt to address?

This paper mainly explores a new neural network architecture, called Neighborhood Attention Transformer (NAT), and a progressive channel fusion strategy (PCF) for speaker verification. In the study, the authors point out that the traditional Transformer architecture requires a large amount of training data for speaker verification tasks, while ECAPA-TDNN typically uses the VoxCeleb1&2 dataset. Therefore, they propose a new NAT model that alternates between neighborhood attention and global attention to capture local and global features, and combines features at different levels. In addition, they draw on the progressive channel fusion strategy of PCF-ECAPA-TDNN to increase the channel dimensionality with increasing network depth. The paper also introduces a variant called PCF-NAT, which extends the receptive field by using 1D convolutions and gradually decreasing the group size to improve performance. Experimental results show that compared to similarly sized ECAPA-TDNN, the shallow PCF-NAT reduces the EER and minDCF by over 20% on average, and the deep PCF-NAT achieves an EER below 0.5% on VoxCeleb1-O. Furthermore, PCF-NAT also performs well on the verification set of VoxSRC and exhibits good scalability. The paper concludes with an ablation study that validates the effectiveness of various components in the model, and discusses future directions, including exploring more efficient downsampling methods and training PCF-NAT on larger datasets to evaluate its generalization ability.

Neighborhood Attention Transformer with Progressive Channel Fusion for Speaker Verification

PCF: ECAPA-TDNN with Progressive Channel Fusion for Speaker Verification

NeXt-TDNN: Modernizing Multi-Scale Temporal Convolution Backbone for Speaker Verification

P-vectors: A Parallel-Coupled TDNN/Transformer Network for Speaker Verification

ECAPA2: A Hybrid Neural Network Architecture and Training Strategy for Robust Speaker Embeddings

VOT: Revolutionizing Speaker Verification with Memory and Attention Mechanisms

CAM++: A Fast and Efficient Network for Speaker Verification Using Context-Aware Masking

TMS: Temporal multi-scale in time-delay neural network for speaker verification

Self-attention Based Speaker Recognition Using Cluster-Range Loss

Self-Attention Networks for Text-Independent Speaker Verification

Multi-Frequency Information Enhanced Channel Attention Module for Speaker Representation Learning

Speaker Representation Learning using Global Context Guided Channel and Time-Frequency Transformations

ERes2NetV2: Boosting Short-Duration Speaker Verification Performance with Computational Efficiency

MEConformer: Highly representative embedding extractor for speaker verification via incorporating selective convolution into deep speaker encoder

An attention-based backend allowing efficient fine-tuning of transformer models for speaker verification

Improving Transformer-based Networks With Locality For Automatic Speaker Verification

Improving Deep CNN Networks with Long Temporal Context for Text-Independent Speaker Verification

NResNet: nested residual network based on channel and frequency domain attention mechanism for speaker verification in classroom

Hierarchically Attending Time-Frequency and Channel Features for Improving Speaker Verification

CNN with Phonetic Attention for Text-Independent Speaker Verification.

Dual-model self-regularization and fusion for domain adaptation of robust speaker verification