Towards Lightweight Speaker Verification via Adaptive Neural Network Quantization

Bei Liu,Haoyu Wang,Yanmin Qian
2024-07-22
Abstract:Modern speaker verification (SV) systems typically demand expensive storage and computing resources, thereby hindering their deployment on mobile devices. In this paper, we explore adaptive neural network quantization for lightweight speaker verification. Firstly, we propose a novel adaptive uniform precision quantization method which enables the dynamic generation of quantization centroids customized for each network layer based on k-means clustering. By applying it to the pre-trained SV systems, we obtain a series of quantized variants with different bit widths. To enhance the performance of low-bit quantized models, a mixed precision quantization algorithm along with a multi-stage fine-tuning (MSFT) strategy is further introduced. Unlike uniform precision quantization, mixed precision approach allows for the assignment of varying bit widths to different network layers. When bit combination is determined, MSFT is employed to progressively quantize and fine-tune network in a specific order. Finally, we design two distinct binary quantization schemes to mitigate performance degradation of 1-bit quantized models: the static and adaptive quantizers. Experiments on VoxCeleb demonstrate that lossless 4-bit uniform precision quantization is achieved on both ResNets and DF-ResNets, yielding a promising compression ratio of around 8. Moreover, compared to uniform precision approach, mixed precision quantization not only obtains additional performance improvements with a similar model size but also offers the flexibility to generate bit combination for any desirable model size. In addition, our suggested 1-bit quantization schemes remarkably boost the performance of binarized models. Finally, a thorough comparison with existing lightweight SV systems reveals that our proposed models outperform all previous methods by a large margin across various model size ranges.
Audio and Speech Processing,Sound
What problem does this paper attempt to address?
The paper aims to address the issue of high storage and computational resource demands faced by modern Speaker Verification (SV) systems when deployed on mobile devices. Although existing deep neural networks (DNNs) perform excellently in speaker verification tasks, these systems typically require expensive storage and computational resources, limiting their application on resource-constrained mobile devices. To this end, the paper explores adaptive neural network quantization techniques to achieve a lightweight speaker verification system. Specifically, the paper proposes the following methods to address this issue: 1. **Adaptive Uniform Precision Quantization**: The paper proposes a new adaptive uniform precision quantization method that dynamically generates quantization centroids for each network layer based on k-means clustering, thereby reducing quantization error. This method can generate quantized models with different bit-widths, suitable for various application scenarios. 2. **Mixed Precision Quantization**: To further improve the performance of low-bit quantized models, the paper introduces a mixed precision quantization algorithm and a Multi-Stage Fine-Tuning (MSFT) strategy. Unlike uniform precision quantization, mixed precision quantization allows different bit-widths to be assigned to different network layers. Through multi-stage fine-tuning, the network is gradually quantized and fine-tuned to optimize model performance. 3. **Binary Quantization Schemes**: To address the performance degradation of 1-bit quantized models, the paper designs two binary quantization schemes: static quantizer and adaptive quantizer. These schemes reduce quantization error and improve the performance of binary models through entropy-preserving weight regularization techniques and dynamic generation of binary sets, respectively. Through these methods, the paper conducted experiments on the VoxCeleb dataset. The results show that 4-bit uniform precision quantization can achieve lossless compression, while mixed precision quantization not only achieves better performance with the same model size but also provides the flexibility to generate models of arbitrary sizes. Additionally, the proposed 1-bit quantization schemes significantly enhance the performance of binary models. Ultimately, compared to existing lightweight speaker verification systems, the proposed methods perform excellently across various model size ranges.