Abstract:With excellent generalization ability, self-supervised speech models have shown impressive performance on various downstream speech tasks in the pre-training and fine-tuning paradigm. However, as the growing size of pre-trained models, fine-tuning becomes practically unfeasible due to heavy computation and storage overhead, as well as the risk of overfitting. Adapters are lightweight modules inserted into pre-trained models to facilitate parameter-efficient adaptation. In this paper, we propose an effective adapter framework designed for adapting self-supervised speech models to the speaker verification task. With a parallel adapter design, our proposed framework inserts two types of adapters into the pre-trained model, allowing the adaptation of latent features within intermediate Transformer layers and output embeddings from all Transformer layers. We conduct comprehensive experiments to validate the efficiency and effectiveness of the proposed framework. Experimental results on the VoxCeleb1 dataset demonstrate that the proposed adapters surpass fine-tuning and other parameter-efficient transfer learning methods, achieving superior performance while updating only 5% of the parameters.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is: when using large - scale pre - trained speech models for Automatic Speaker Verification (ASV) tasks, how to adapt these pre - trained models in a more efficient way while avoiding problems such as high computational and storage costs and over - fitting risks brought by full - scale fine - tuning. Specifically, the author proposes an effective adapter framework, aiming to achieve parameter - efficient model adaptation by inserting lightweight modules (i.e., adapters) into the pre - trained model, so as to achieve or even exceed the effect of full - scale fine - tuning while only updating a small number of parameters. ### Background and Motivation of the Paper With the development of Self - Supervised Learning (SSL), large - scale pre - trained speech models perform well in various downstream speech tasks. However, due to the large scale of these models, direct full - scale fine - tuning is not only computationally and storage - costly, but also prone to over - fitting, especially in the case of limited data. Therefore, researching how to use these pre - trained models in a more efficient way has become an important direction in current research. ### Proposed Method To address the above challenges, the author proposes a framework that includes two types of adapters: 1. **Inner - layer Adapter**: Inserted into the intermediate layer of the Transformer to adjust the latent features of the intermediate layer. 2. **Inter - layer Adapter**: Inserted after the weighted sum operation to adjust the aggregated hidden representations extracted from all layers. In addition, the author also introduces a parallel adapter design. By inserting adapter branches in parallel and controlling the adapter output through a scaling operation, it balances task - independent and task - related feature learning. ### Experimental Results The experimental results show that the proposed adapter framework significantly outperforms other transfer learning methods and full - scale fine - tuning on the VoxCeleb1 dataset. It can achieve or exceed the performance of full - scale fine - tuning by only updating 5% of the parameters. Especially on the 1st48 - UTD forensic dataset, this method also shows excellent performance, proving its effectiveness and robustness in complex scenarios. ### Main Contributions 1. **Proposed an effective adapter framework**, making full use of speaker - related information at different levels in the pre - trained model. 2. **Introduced a parallel adapter design**, helping the pre - trained model learn complementary task - specific knowledge. 3. **Verified the effectiveness and efficiency of the framework through comprehensive experiments**, demonstrating the ability to achieve excellent performance under the premise of parameter - efficiency. In conclusion, through innovative adapter design, this paper solves the problem of efficient adaptation of large - scale pre - trained speech models in Automatic Speaker Verification tasks, providing new ideas for future research.

Efficient Adapter Tuning of Pre-trained Speech Models for Automatic Speaker Verification

SE/BN Adapter: Parametric Efficient Domain Adaptation for Speaker Recognition

Efficient Adapter Transfer of Self-Supervised Speech Models for Automatic Speech Recognition

Exploring efficient-tuning methods in self-supervised speech models

Efficient Adapters for Giant Speech Models

ELP-Adapters: Parameter Efficient Adapter Tuning for Various Speech Processing Tasks

CHAPTER: Exploiting Convolutional Neural Network Adapters for Self-supervised Speech Models

Lightweight Adapter Tuning for Multilingual Speech Translation

VL-Adapter: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks

ADAPTERMIX: Exploring the Efficacy of Mixture of Adapters for Low-Resource TTS Adaptation

Evaluating Parameter-Efficient Transfer Learning Approaches on SURE Benchmark for Speech Understanding

An Adapter-Based Unified Model for Multiple Spoken Language Processing Tasks

An attention-based backend allowing efficient fine-tuning of transformer models for speaker verification

Exploiting Adapters for Cross-Lingual Low-Resource Speech Recognition

Improving Speaker Verification with Self-Pretrained Transformer Models

Parameter-efficient Dysarthric Speech Recognition Using Adapter Fusion and Householder Transformation

UniAdapter: Unified Parameter-Efficient Transfer Learning for Cross-modal Modeling

Leveraging ASR Pretrained Conformers for Speaker Verification through Transfer Learning and Knowledge Distillation

Cross-Modal Adapter: Parameter-Efficient Transfer Learning Approach for Vision-Language Models

Adapting Pre-Trained Self-Supervised Learning Model for Speech Recognition with Light-Weight Adapters

Adaptable Adapters