Efficient Adapter Tuning of Pre-trained Speech Models for Automatic Speaker Verification

Mufan Sang,John H.L. Hansen
2024-03-01
Abstract:With excellent generalization ability, self-supervised speech models have shown impressive performance on various downstream speech tasks in the pre-training and fine-tuning paradigm. However, as the growing size of pre-trained models, fine-tuning becomes practically unfeasible due to heavy computation and storage overhead, as well as the risk of overfitting. Adapters are lightweight modules inserted into pre-trained models to facilitate parameter-efficient adaptation. In this paper, we propose an effective adapter framework designed for adapting self-supervised speech models to the speaker verification task. With a parallel adapter design, our proposed framework inserts two types of adapters into the pre-trained model, allowing the adaptation of latent features within intermediate Transformer layers and output embeddings from all Transformer layers. We conduct comprehensive experiments to validate the efficiency and effectiveness of the proposed framework. Experimental results on the VoxCeleb1 dataset demonstrate that the proposed adapters surpass fine-tuning and other parameter-efficient transfer learning methods, achieving superior performance while updating only 5% of the parameters.
Audio and Speech Processing,Machine Learning,Sound
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is: when using large - scale pre - trained speech models for Automatic Speaker Verification (ASV) tasks, how to adapt these pre - trained models in a more efficient way while avoiding problems such as high computational and storage costs and over - fitting risks brought by full - scale fine - tuning. Specifically, the author proposes an effective adapter framework, aiming to achieve parameter - efficient model adaptation by inserting lightweight modules (i.e., adapters) into the pre - trained model, so as to achieve or even exceed the effect of full - scale fine - tuning while only updating a small number of parameters. ### Background and Motivation of the Paper With the development of Self - Supervised Learning (SSL), large - scale pre - trained speech models perform well in various downstream speech tasks. However, due to the large scale of these models, direct full - scale fine - tuning is not only computationally and storage - costly, but also prone to over - fitting, especially in the case of limited data. Therefore, researching how to use these pre - trained models in a more efficient way has become an important direction in current research. ### Proposed Method To address the above challenges, the author proposes a framework that includes two types of adapters: 1. **Inner - layer Adapter**: Inserted into the intermediate layer of the Transformer to adjust the latent features of the intermediate layer. 2. **Inter - layer Adapter**: Inserted after the weighted sum operation to adjust the aggregated hidden representations extracted from all layers. In addition, the author also introduces a parallel adapter design. By inserting adapter branches in parallel and controlling the adapter output through a scaling operation, it balances task - independent and task - related feature learning. ### Experimental Results The experimental results show that the proposed adapter framework significantly outperforms other transfer learning methods and full - scale fine - tuning on the VoxCeleb1 dataset. It can achieve or exceed the performance of full - scale fine - tuning by only updating 5% of the parameters. Especially on the 1st48 - UTD forensic dataset, this method also shows excellent performance, proving its effectiveness and robustness in complex scenarios. ### Main Contributions 1. **Proposed an effective adapter framework**, making full use of speaker - related information at different levels in the pre - trained model. 2. **Introduced a parallel adapter design**, helping the pre - trained model learn complementary task - specific knowledge. 3. **Verified the effectiveness and efficiency of the framework through comprehensive experiments**, demonstrating the ability to achieve excellent performance under the premise of parameter - efficiency. In conclusion, through innovative adapter design, this paper solves the problem of efficient adaptation of large - scale pre - trained speech models in Automatic Speaker Verification tasks, providing new ideas for future research.