Abstract:Speaker verification systems experience significant performance degradation when tasked with short-duration trial recordings. To address this challenge, a multi-scale feature fusion approach has been proposed to effectively capture speaker characteristics from short utterances. Constrained by the model's size, a robust backbone Enhanced Res2Net (ERes2Net) combining global and local feature fusion demonstrates sub-optimal performance in short-duration speaker verification. To further improve the short-duration feature extraction capability of ERes2Net, we expand the channel dimension within each stage. However, this modification also increases the number of model parameters and computational complexity. To alleviate this problem, we propose an improved ERes2NetV2 by pruning redundant structures, ultimately reducing both the model parameters and its computational cost. A range of experiments conducted on the VoxCeleb datasets exhibits the superiority of ERes2NetV2, which achieves EER of 0.61% for the full-duration trial, 0.98% for the 3s-duration trial, and 1.48% for the 2s-duration trial on VoxCeleb1-O, respectively.

What problem does this paper attempt to address?

This paper focuses on improving the performance of the Speaker Verification system for short-duration speech segments, particularly those that are only a few seconds long. The current systems perform poorly when handling short-duration speech, so the paper proposes an improved model called ERes2NetV2, which combines Bottom-up Dual-stage Feature Fusion (BDFF) and Bottleneck-like Local Feature Fusion (BLFF). ERes2NetV2 enhances the feature extraction capability by expanding the channel dimension of each stage, while reducing the model parameters and computational complexity by trimming redundant structures. BDFF fuses multi-scale feature maps from different stages in a bottom-up path to capture global information and reduce structural redundancy. BLFF adopts the bottleneck feature structure to expand the channel dimension of feature maps and then compress the channel dimension of segmentation features, aiming to strengthen short-duration feature extraction while reducing the number of parameters and computational complexity. Experiments were conducted on the VoxCeleb and 3D-Speaker datasets, and the results demonstrate that ERes2NetV2 outperforms the baseline system in the full-duration, 3-second, and 2-second trials, significantly improving the verification performance, especially when handling short-duration speech. Additionally, t-SNE visualization analysis shows that the short-duration speaker embeddings extracted by ERes2NetV2 are more discriminative. In conclusion, this paper addresses the challenges of short-duration speaker verification and proposes a more efficient and superior model, providing an effective approach for short-duration speech recognition.

ERes2NetV2: Boosting Short-Duration Speaker Verification Performance with Computational Efficiency