Abstract:Modern speaker verification (SV) systems typically demand expensive storage and computing resources, thereby hindering their deployment on mobile devices. In this paper, we explore adaptive neural network quantization for lightweight speaker verification. Firstly, we propose a novel adaptive uniform precision quantization method which enables the dynamic generation of quantization centroids customized for each network layer based on k-means clustering. By applying it to the pre-trained SV systems, we obtain a series of quantized variants with different bit widths. To enhance the performance of low-bit quantized models, a mixed precision quantization algorithm along with a multi-stage fine-tuning (MSFT) strategy is further introduced. Unlike uniform precision quantization, mixed precision approach allows for the assignment of varying bit widths to different network layers. When bit combination is determined, MSFT is employed to progressively quantize and fine-tune network in a specific order. Finally, we design two distinct binary quantization schemes to mitigate performance degradation of 1-bit quantized models: the static and adaptive quantizers. Experiments on VoxCeleb demonstrate that lossless 4-bit uniform precision quantization is achieved on both ResNets and DF-ResNets, yielding a promising compression ratio of around 8. Moreover, compared to uniform precision approach, mixed precision quantization not only obtains additional performance improvements with a similar model size but also offers the flexibility to generate bit combination for any desirable model size. In addition, our suggested 1-bit quantization schemes remarkably boost the performance of binarized models. Finally, a thorough comparison with existing lightweight SV systems reveals that our proposed models outperform all previous methods by a large margin across various model size ranges.

Effective SVD-Based Deep Network Compression for Automatic Speech Recognition.

Speech Recognition Model Compression

Model Compression for DNN-based Speaker Verification Using Weight Quantization

EfficientASR: Speech Recognition Network Compression via Attention Redundancy and Chunk-Level FFN Optimization

On the Application and Compression of Deep Time Delay Neural Network for Embedded Statistical Parametric Speech Synthesis

A Model Compression Method with Matrix Product Operators for Speech Enhancement

ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models

Towards Lightweight Speaker Verification via Adaptive Neural Network Quantization

Accurate and Structured Pruning for Efficient Automatic Speech Recognition

SVD-LLM: Truncation-aware Singular Value Decomposition for Large Language Model Compression

Comprehensive SNN Compression Using ADMM Optimization and Activity Regularization

Improving deep neural networks for LVCSR using dropout and shrinking structure

Speaker Adaptation of Hybrid NN/HMM Model for Speech Recognition Based on Singular Value Decomposition

SpeechNAS: Towards Better Trade-off Between Latency and Accuracy for Large-Scale Speaker Verification

Deep learning model compression using network sensitivity and gradients

An Effective Deep Embedding Learning Method Based on Dense-Residual Networks for Speaker Verification

Efficient Binary Weight Convolutional Network Accelerator for Speech Recognition

USM-Lite: Quantization and Sparsity Aware Fine-tuning for Speech Recognition with Universal Speech Models

An Effective Deep Embedding Learning Architecture for Speaker Verification.

Highly Efficient Neural Network Language Model Compression Using Soft Binarization Training

SVD-Based Channel Pruning for Convolutional Neural Network in Acoustic Scene Classification Model