Abstract:Gradient clipping plays a vital role in training large-scale automatic speech recognition (ASR) models. It is typically applied to minibatch gradients to prevent gradient explosion, and to the individual sample gradients to mitigate unintended memorization. This work systematically investigates the impact of a specific granularity of gradient clipping, namely per-core clip-ping (PCC), across training a wide range of ASR models. We empirically demonstrate that PCC can effectively mitigate unintended memorization in ASR models. Surprisingly, we find that PCC positively influences ASR performance metrics, leading to improved convergence rates and reduced word error rates. To avoid tuning the additional hyperparameter introduced by PCC, we further propose a novel variant, adaptive per-core clipping (APCC), for streamlined optimization. Our findings highlight the multifaceted benefits of PCC as a strategy for robust, privacy-forward ASR model training.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve two key problems encountered when training large - scale Automatic Speech Recognition (ASR) models: 1. **Unintended Memorization**: - Many large neural networks, including generative vision and language models, repeat their training data during the inference process, which brings privacy risks. For non - generative ASR models, research has also shown that they are vulnerable to privacy attacks. - To protect user privacy and prevent the model from remembering sensitive information in the training data, effective techniques are required to suppress this unintended memorization. 2. **Computational Efficiency**: - When training large - scale ASR models, the traditional Differentially Private Stochastic Gradient Descent (DP - SGD) method can provide strong privacy protection, but it brings significant computational and memory overheads, making it unbearable in practical applications. - A method is needed that can effectively suppress memorization without significantly increasing the computational burden. To solve the above problems, the author proposes a new gradient clipping method - **Per - core Clipping (PCC)** and validates it on multiple ASR models. Specifically: - **PCC**: By clipping the gradient on each computing core instead of on each sample, the computational and memory overheads are greatly reduced. - **Adaptive PCC (APCC)**: Further introduces an adaptive mechanism, eliminating the need for manual adjustment of the clipping boundary and simplifying the optimization process. The experimental results show that PCC can not only effectively suppress unintended memorization, but also unexpectedly improve the performance of ASR models, reduce the Word Error Rate (WER), and accelerate the convergence speed. These findings highlight the multiple advantages of PCC as a robust and privacy - conscious training strategy for ASR models. ### Summary The main contributions of this paper include: 1. Proposing a new gradient clipping method - Per - core Clipping (PCC), which can effectively suppress unintended memorization with almost no increase in computational overhead. 2. Conducting extensive empirical evaluations of PCC, demonstrating its performance and convergence rate improvements on multiple ASR models. 3. Introducing Adaptive PCC (APCC), eliminating the need for additional hyper - parameter adjustment and further simplifying the optimization process. Through these contributions, the paper provides new solutions for privacy protection and computational efficiency in large - scale ASR models.

Efficiently Train ASR Models that Memorize Less and Perform Better with Per-core Clipping

Revisit Micro-batch Clipping: Adaptive Data Pruning via Gradient Manipulation

Unintended Memorization in Large ASR Models, and How to Mitigate It

Training Large ASR Encoders with Differential Privacy

Extreme Encoder Output Frame Rate Reduction: Improving Computational Latencies of Large End-to-End Models

Continual Learning Optimizations for Auto-regressive Decoder of Multilingual ASR systems

Sequential Editing for Lifelong Training of Speech Recognition Models

Dynamic ASR Pathways: An Adaptive Masking Approach Towards Efficient Pruning of A Multilingual ASR Model

Rehearsal-Free Online Continual Learning for Automatic Speech Recognition

Puncturing the Memory Wall

Not All Features Matter: Enhancing Few-shot CLIP with Adaptive Prior Refinement

Retrieve and Copy: Scaling ASR Personalization to Large Catalogs

Differentially Private Parameter-Efficient Fine-tuning for Large ASR Models

Transcription-Free Fine-Tuning of Speech Separation Models for Noisy and Reverberant Multi-Speaker Automatic Speech Recognition

CTC Blank Triggered Dynamic Layer-Skipping for Efficient CTC-based Speech Recognition

Accurate and Structured Pruning for Efficient Automatic Speech Recognition

Optimizing Byte-level Representation for End-to-end ASR

EfficientASR: Speech Recognition Network Compression via Attention Redundancy and Chunk-Level FFN Optimization

Complexity boosted adaptive training for better low resource ASR performance

Extremely Low Footprint End-to-End ASR System for Smart Device

Learning to Recognize Code-switched Speech Without Forgetting Monolingual Speech Recognition