VarASV: Enabling Pitch-variable Automatic Speaker Verification Via Multi-task Learning
Yizhuo Gao,Xinfeng Li,Chaohao Li,Weinong Sun,Xiaoyu Ji,Wenyuan Xu
DOI: https://doi.org/10.1109/ei252483.2021.9713653
2021-01-01
Abstract:With the proliferation of voice assistants, using voice commands in Energy Internet becomes common practice, e.g., inspectors could control the substation inspection robot to perform automatic detection and information query by voice instructions. However, existing ASV systems rarely consider the change of the speaker's pitch and thus perform poorly on the pitch-variable speaker scenarios. In this paper, we propose VarASV, a pitch-robust automatic speaker verification system to verify the identity of the speaker whose pitch can be various in different situations. To overcome the challenge of variable pitch, we designed a multi-task learning (MTL) framework which contains speaker, gender, and language verification tasks, to train the feature extractor, which is based on residual network (ResNet), Then, we employ the feature extractor to generate the identity embedding that is used for verification. Since gender and language are the pitch-related labels in MTL, they could make the feature extractor more robust to pitch-variable utterances. Through the evaluation of an open-source dataset (i.e., JukeBox), we demonstrate that VarASV has a 21.68% EER value and an 18.37% improvement compared with the baseline model (i.e., i-vector model).