Abstract:Modern speaker verification (SV) systems typically demand expensive storage and computing resources, thereby hindering their deployment on mobile devices. In this paper, we explore adaptive neural network quantization for lightweight speaker verification. Firstly, we propose a novel adaptive uniform precision quantization method which enables the dynamic generation of quantization centroids customized for each network layer based on k-means clustering. By applying it to the pre-trained SV systems, we obtain a series of quantized variants with different bit widths. To enhance the performance of low-bit quantized models, a mixed precision quantization algorithm along with a multi-stage fine-tuning (MSFT) strategy is further introduced. Unlike uniform precision quantization, mixed precision approach allows for the assignment of varying bit widths to different network layers. When bit combination is determined, MSFT is employed to progressively quantize and fine-tune network in a specific order. Finally, we design two distinct binary quantization schemes to mitigate performance degradation of 1-bit quantized models: the static and adaptive quantizers. Experiments on VoxCeleb demonstrate that lossless 4-bit uniform precision quantization is achieved on both ResNets and DF-ResNets, yielding a promising compression ratio of around 8. Moreover, compared to uniform precision approach, mixed precision quantization not only obtains additional performance improvements with a similar model size but also offers the flexibility to generate bit combination for any desirable model size. In addition, our suggested 1-bit quantization schemes remarkably boost the performance of binarized models. Finally, a thorough comparison with existing lightweight SV systems reveals that our proposed models outperform all previous methods by a large margin across various model size ranges.

What problem does this paper attempt to address?

The paper aims to address the issue of high storage and computational resource demands faced by modern Speaker Verification (SV) systems when deployed on mobile devices. Although existing deep neural networks (DNNs) perform excellently in speaker verification tasks, these systems typically require expensive storage and computational resources, limiting their application on resource-constrained mobile devices. To this end, the paper explores adaptive neural network quantization techniques to achieve a lightweight speaker verification system. Specifically, the paper proposes the following methods to address this issue: 1. **Adaptive Uniform Precision Quantization**: The paper proposes a new adaptive uniform precision quantization method that dynamically generates quantization centroids for each network layer based on k-means clustering, thereby reducing quantization error. This method can generate quantized models with different bit-widths, suitable for various application scenarios. 2. **Mixed Precision Quantization**: To further improve the performance of low-bit quantized models, the paper introduces a mixed precision quantization algorithm and a Multi-Stage Fine-Tuning (MSFT) strategy. Unlike uniform precision quantization, mixed precision quantization allows different bit-widths to be assigned to different network layers. Through multi-stage fine-tuning, the network is gradually quantized and fine-tuned to optimize model performance. 3. **Binary Quantization Schemes**: To address the performance degradation of 1-bit quantized models, the paper designs two binary quantization schemes: static quantizer and adaptive quantizer. These schemes reduce quantization error and improve the performance of binary models through entropy-preserving weight regularization techniques and dynamic generation of binary sets, respectively. Through these methods, the paper conducted experiments on the VoxCeleb dataset. The results show that 4-bit uniform precision quantization can achieve lossless compression, while mixed precision quantization not only achieves better performance with the same model size but also provides the flexibility to generate models of arbitrary sizes. Additionally, the proposed 1-bit quantization schemes significantly enhance the performance of binary models. Ultimately, compared to existing lightweight speaker verification systems, the proposed methods perform excellently across various model size ranges.

Towards Lightweight Speaker Verification via Adaptive Neural Network Quantization

Lowbit Neural Network Quantization for Speaker Verification

Extremely Low Bit Quantization for Mobile Speaker Verification Systems under 1MB Memory

Hessian-based Mixed-Precision Quantization with Transition Aware Training for Neural Networks

Model Compression for DNN-based Speaker Verification Using Weight Quantization

Optimization of DNN-based speaker verification model through efficient quantization technique

LIGHT-WEIGHT VISUALVOICE: NEURAL NETWORK QUANTIZATION ON AUDIO VISUAL SPEECH SEPARATION

Mixed Precision Low-bit Quantization of Neural Network Language Models for Speech Recognition

SQuantizer: Simultaneous Learning for Both Sparse and Low-precision Neural Networks

Towards Low-Bit Quantization of Deep Neural Networks with Limited Data.

A TWN Inspired Speaker Verification Processor with Hardware-friendly Weight Quantization.

Quantization Networks

Lightweight Speaker Verification Using Transformation Module with Feature Partition and Fusion

VecQ: Minimal Loss DNN Model Compression With Vectorized Weight Quantization

A 4-Bit Integer-Only Neural Network Quantization Method Based on Shift Batch Normalization

2-bit Conformer quantization for automatic speech recognition

Robustness-aware 2-bit quantization with real-time performance for neural network

VS-Quant: Per-vector Scaled Quantization for Accurate Low-Precision Neural Network Inference

An Energy and Memory Efficient Speaker Verification System Based on Binary Neural Networks

A Model for Every User and Budget: Label-Free and Personalized Mixed-Precision Quantization

Sub-8-bit quantization for on-device speech recognition: a regularization-free approach