Efficiently Train ASR Models that Memorize Less and Perform Better with Per-core Clipping

Lun Wang,Om Thakkar,Zhong Meng,Nicole Rafidi,Rohit Prabhavalkar,Arun Narayanan
2024-06-06
Abstract:Gradient clipping plays a vital role in training large-scale automatic speech recognition (ASR) models. It is typically applied to minibatch gradients to prevent gradient explosion, and to the individual sample gradients to mitigate unintended memorization. This work systematically investigates the impact of a specific granularity of gradient clipping, namely per-core clip-ping (PCC), across training a wide range of ASR models. We empirically demonstrate that PCC can effectively mitigate unintended memorization in ASR models. Surprisingly, we find that PCC positively influences ASR performance metrics, leading to improved convergence rates and reduced word error rates. To avoid tuning the additional hyperparameter introduced by PCC, we further propose a novel variant, adaptive per-core clipping (APCC), for streamlined optimization. Our findings highlight the multifaceted benefits of PCC as a strategy for robust, privacy-forward ASR model training.
Cryptography and Security,Computation and Language,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve two key problems encountered when training large - scale Automatic Speech Recognition (ASR) models: 1. **Unintended Memorization**: - Many large neural networks, including generative vision and language models, repeat their training data during the inference process, which brings privacy risks. For non - generative ASR models, research has also shown that they are vulnerable to privacy attacks. - To protect user privacy and prevent the model from remembering sensitive information in the training data, effective techniques are required to suppress this unintended memorization. 2. **Computational Efficiency**: - When training large - scale ASR models, the traditional Differentially Private Stochastic Gradient Descent (DP - SGD) method can provide strong privacy protection, but it brings significant computational and memory overheads, making it unbearable in practical applications. - A method is needed that can effectively suppress memorization without significantly increasing the computational burden. To solve the above problems, the author proposes a new gradient clipping method - **Per - core Clipping (PCC)** and validates it on multiple ASR models. Specifically: - **PCC**: By clipping the gradient on each computing core instead of on each sample, the computational and memory overheads are greatly reduced. - **Adaptive PCC (APCC)**: Further introduces an adaptive mechanism, eliminating the need for manual adjustment of the clipping boundary and simplifying the optimization process. The experimental results show that PCC can not only effectively suppress unintended memorization, but also unexpectedly improve the performance of ASR models, reduce the Word Error Rate (WER), and accelerate the convergence speed. These findings highlight the multiple advantages of PCC as a robust and privacy - conscious training strategy for ASR models. ### Summary The main contributions of this paper include: 1. Proposing a new gradient clipping method - Per - core Clipping (PCC), which can effectively suppress unintended memorization with almost no increase in computational overhead. 2. Conducting extensive empirical evaluations of PCC, demonstrating its performance and convergence rate improvements on multiple ASR models. 3. Introducing Adaptive PCC (APCC), eliminating the need for additional hyper - parameter adjustment and further simplifying the optimization process. Through these contributions, the paper provides new solutions for privacy protection and computational efficiency in large - scale ASR models.