LIGHT-WEIGHT VISUALVOICE: NEURAL NETWORK QUANTIZATION ON AUDIO VISUAL SPEECH SEPARATION

Yifei Wu,Chenda Li,Yanmin Qian
DOI: https://doi.org/10.1109/icasspw59220.2023.10193263
2023-01-01
Abstract:As multi-modal systems show superior performance on more tasks, the huge amount of computational resources they need becomes one of the critical problems to be solved. In this work, we explore neural network quantization methods to compress the resource requirement of VisualVoice, a state-of-the-art audio-visual speech separation system. The model is firstly fine-tuned by an ADMM-based quantization-aware training approach to produce the fixed-precision quantized version. Then three strategies, including manual selection, Hessian trace-based selection and KL divergence-based greedy search are explored to find the optimal mixed-precision setting of the model. The result shows that by applying the optimal strategy, we obtain a satisfying trade-off between space, speed and performance for the final system. The KL divergence-based strategy reaches 7.2 dB in SDR at 3-bit equivalent setup, which outperforms the fixed-precision setup and the other two mixed-precision strategies. Moreover, we also discuss the influence caused by quantizing different parts of the multi-modal system.
What problem does this paper attempt to address?