Speaker-Smoothed kNN Speaker Adaptation for End-to-End ASR
Shaojun Li,Daimeng Wei,Hengchao Shang,Jiaxin Guo,ZongYao Li,Zhanglin Wu,Zhiqiang Rao,Yuanchang Luo,Xianghui He,Hao Yang
2024-07-02
Abstract:Despite recent improvements in End-to-End Automatic Speech Recognition (E2E ASR) systems, the performance can degrade due to vocal characteristic mismatches between training and testing data, particularly with limited target speaker adaptation data. We propose a novel speaker adaptation approach Speaker-Smoothed kNN that leverages k-Nearest Neighbors (kNN) retrieval techniques to improve model output by finding correctly pronounced tokens from its pre-built datastore during the decoding phase. Moreover, we utilize x-vector to dynamically adjust kNN interpolation parameters for data sparsity issue. This approach was validated using KeSpeech and MagicData corpora under in-domain and all-domain settings. Our method consistently performs comparably to fine-tuning without the associated performance degradation during speaker changes. Furthermore, in the all-domain setting, our method achieves state-of-the-art results, reducing the CER in both single speaker and multi-speaker test scenarios.
Sound,Audio and Speech Processing