Shuffle is What You Need

Wan Lin,Lantian Li,Dong Wang
DOI: https://doi.org/10.1109/ISCSLP57327.2022.10037896
2022-01-01
Abstract:Self-supervised learning gains extensive attention in speaker recognition, partly due to the difficulty of collecting data with large-scale speaker labels. Contrastive learning is among the most popular approaches in this setting, where similar pairs (positive) are sampled from the same utterance while dissimilar pairs (negative) are sampled from different utterances. Despite the promising results reported in the literature, we argue that the random sampling approach may lead to unideal content residual in speaker embeddings, due to the learning of content dependency in positive pairs. In this paper, we investigate a novel frame shuffle approach, which constructs positive pairs by shuffling the frames of the anchor segment. Our experimental results on the VCTK dataset showed that the new approach can obtain comparable or better performance compared to random sampling. Moreover, the frame shuffle approach fully corrupts the linguistic content in the training data, which enforces the learned model being language independent. We tested the hypothesis in both multi-lingual and cross-lingual scenarios and observed remarkable performance improvement over the random sampling baseline.
What problem does this paper attempt to address?