Local Information Modeling with Self-Attention for Speaker Verification

Bing Han,Zhengyang Chen,Yanmin Qian
DOI: https://doi.org/10.1109/icassp43922.2022.9746050
2022-05-23
Abstract:Transformer based on self attention mechanism has demonstrated its state-of-the-art performance in most natural language processing (NLP) tasks, but it’s not very competitive when applied for speaker verification in previous works. Generally, speaker identity is mostly reflected by the relationship between adjacent tokens, whose extraction mainly depends on local modeling ability. However, the self-attention module, as the key component of transformer, can help the model make full use of global information but insufficient to capture the local information. To tackle this limitation, in this paper, we strengthen the local information modeling from two different aspects: restricting the attention context to be local and introducing convolution operation into transformer. Experiments conducted on Voxceleb illustrate that our proposed methods can notably improve system performance, verifying the significance of local information for speaker verification task.
What problem does this paper attempt to address?