Abstract:End-to-end models have shown superior performance for automatic speech recognition (ASR). However, such models are often very large in size and thus challenging to deploy on resource-constrained edge devices. While quantisation can reduce model sizes, it can lead to increased word error rates (WERs). Although improved quantisation methods were proposed to address the issue of performance degradation, the fact that quantised models deployed on edge devices often target only on a small group of users is under-explored. To this end, we propose personalisation for quantised models (P4Q), a novel strategy that uses speaker adaptation (SA) to improve quantised end-to-end ASR models by fitting them to the characteristics of the target speakers. In this paper, we study the P4Q strategy based on Whisper and Conformer attention-based encoder-decoder (AED) end-to-end ASR models, which leverages a 4-bit block-wise NormalFloat4 (NF4) approach for quantisation and the low-rank adaptation (LoRA) approach for SA. Experimental results on the LibriSpeech and the TED-LIUM 3 corpora show that, with a 7-time reduction in model size and 1% extra speaker-specific parameters, 15.1% and 23.3% relative WER reductions were achieved on quantised Whisper and Conformer AED models respectively, comparing to the full precision models.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is: when deploying the quantized end - to - end Automatic Speech Recognition (ASR) model on resource - constrained edge devices, how to improve the performance of the model through personalized adaptation (i.e., speaker adaptation) and thus reduce the Word Error Rate (WER). Specifically, although quantization can significantly reduce the size of the model and make it easier to be deployed on edge devices, it usually leads to performance degradation, manifested as an increase in the word error rate. To solve this problem, the author proposes a new strategy - the Personalized Quantization for Quartet (P4Q), which combines the block - wise NormalFloat4 (NF4) quantization method and the Low - Rank Adaptation (LoRA) method for speaker adaptation. In this way, P4Q can significantly improve the recognition performance for specific speakers while keeping the model size small. ### Main contributions 1. **Propose the P4Q strategy**: By combining quantization with speaker adaptation, the performance of the quantized model is improved. 2. **Experimental verification**: Experiments were carried out on the LibriSpeech and TED - LIUM 3 datasets, and the results show that the P4Q strategy can significantly reduce the word error rate of the quantized model. ### Specific technical details - **NF4 quantization**: Use the block - wise NormalFloat4 quantization method to quantize the model weights to reduce quantization errors. - **LoRA speaker adaptation**: Fine - tune the model through the low - rank adaptation method to avoid the over - fitting problem that may be caused by full - scale fine - tuning. ### Experimental results - On the Whisper and Conformer models, after using the P4Q strategy, the model size is reduced by about 7 times, and the relative word error rates are reduced by 15.1% and 23.3% respectively. In summary, this paper aims to improve the performance of the quantized end - to - end ASR model on edge devices by introducing the method of speaker adaptation, so as to achieve more efficient and accurate speech recognition.

Speaker Adaptation for Quantised End-to-End ASR Models

Enhancing Quantised End-to-End ASR Models via Personalisation

SQ-Whisper: Speaker-Querying based Whisper Model for Target-Speaker ASR

2-bit Conformer quantization for automatic speech recognition

SAML: Speaker Adaptive Mixture of LoRA Experts for End-to-End ASR

A Model for Every User and Budget: Label-Free and Personalized Mixed-Precision Quantization

USM-Lite: Quantization and Sparsity Aware Fine-tuning for Speech Recognition with Universal Speech Models

Sub-8-bit quantization for on-device speech recognition: a regularization-free approach

Towards Green ASR: Lossless 4-bit Quantization of a Hybrid TDNN System on the 300-hr Switchboard Corpus

Towards Lightweight Speaker Verification via Adaptive Neural Network Quantization

PTQ4ADM: Post-Training Quantization for Efficient Text Conditional Audio Diffusion Models

L4Q: Parameter Efficient Quantization-Aware Fine-Tuning on Large Language Models

Mixed Precision Low-bit Quantization of Neural Network Language Models for Speech Recognition

Speaker Adaptation Using Spectro-Temporal Deep Features for Dysarthric and Elderly Speech Recognition

Listen, Attend, Spell and Adapt: Speaker Adapted Sequence-to-Sequence ASR

RAND: Robustness Aware Norm Decay For Quantized Seq2seq Models

Sharing Low Rank Conformer Weights for Tiny Always-On Ambient Speech Recognition Models

On-Device Personalization of Automatic Speech Recognition Models for Disordered Speech

Resource-Efficient Adaptation of Speech Foundation Models for Multi-Speaker ASR

Adapting an ASR Foundation Model for Spoken Language Assessment

Dynamic Speaker Representations Adjustment and Decoder Factorization for Speaker Adaptation in End-to-End Speech Synthesis