Abstract:This technical report describes Johns Hopkins University speaker recognition system submitted to Voxceleb Speaker Recognition Challenge 2021 Track 3: Self-supervised speaker verification (closed). Our overall training process is similar to the proposed one from the first place team in the last year's VoxSRC2020 challenge. The main difference is a recently proposed non-contrastive self-supervised method in computer vision (CV), distillation with no labels (DINO), is used to train our initial model, which outperformed the last year's contrastive learning based on momentum contrast (MoCo). Also, this requires only a few iterations in the iterative clustering stage, where pseudo labels for supervised embedding learning are updated based on the clusters of the embeddings generated from a model that is continually fine-tuned over iterations. In the final stage, Res2Net50 is trained on the final pseudo labels from the iterative clustering stage. This is our best submitted model to the challenge, showing 1.89, 6.50, and 6.89 in EER(%) in voxceleb1 test o, VoxSRC-21 validation, and test trials, respectively.
What problem does this paper attempt to address?
The paper primarily focuses on addressing the Track 3 task in the VoxSRC-21 speech recognition challenge, which is the self-supervised speaker verification problem. Participants are only allowed to use the VoxCeleb2 dataset, which lacks annotated speaker information, for model training, and validate the system using the VoxCeleb1 test set or other provided verification pairs. The core contribution of the paper is the introduction of a new non-contrastive self-supervised learning method—DINO, for initial model training. This method outperforms traditional momentum contrast (MoCo)-based methods in generating pseudo-labels. By adopting DINO, the research team was able to achieve better results with only a few iterations during the iterative clustering phase, and ultimately improved system performance by training larger models (such as Res2Net50) on the generated pseudo-labels. Throughout the process, the team also carefully adjusted the training segment length to avoid overfitting and ensure the model's performance on shorter speech segments. Ultimately, the system achieved error acceptance rates (EER) of 1.91%, 6.32%, and 6.64% on the VoxCeleb1 test set, VoxSRC-21 validation set, and test trials, respectively.