Abstract:While federated learning (FL) has recently emerged as a promising approach to train machine learning models, it is limited to only preliminary explorations in the domain of automatic speech recognition (ASR). Moreover, FL does not inherently guarantee user privacy and requires the use of differential privacy (DP) for robust privacy guarantees. However, we are not aware of prior work on applying DP to FL for ASR. In this paper, we aim to bridge this research gap by formulating an ASR benchmark for FL with DP and establishing the first baselines. First, we extend the existing research on FL for ASR by exploring different aspects of recent $\textit{large end-to-end transformer models}$: architecture design, seed models, data heterogeneity, domain shift, and impact of cohort size. With a $\textit{practical}$ number of central aggregations we are able to train $\textbf{FL models}$ that are \textbf{nearly optimal} even with heterogeneous data, a seed model from another domain, or no pre-trained seed model. Second, we apply DP to FL for ASR, which is non-trivial since DP noise severely affects model training, especially for large transformer models, due to highly imbalanced gradients in the attention block. We counteract the adverse effect of DP noise by reviving per-layer clipping and explaining why its effect is more apparent in our case than in the prior work. Remarkably, we achieve user-level ($7.2$, $10^{-9}$)-$\textbf{DP}$ (resp. ($4.5$, $10^{-9}$)-$\textbf{DP}$) with a 1.3% (resp. 4.6%) absolute drop in the word error rate for extrapolation to high (resp. low) population scale for $\textbf{FL with DP in ASR}$.

Parameter-Efficient Transfer Learning under Federated Learning for Automatic Speech Recognition

Communication-Efficient Personalized Federated Learning for Speech-to-Text Tasks

Resource-Efficient Transfer Learning from Speech Foundation Model Using Hierarchical Feature Fusion

A GDPR-compliant Ecosystem for Speech Recognition with Transfer, Federated, and Evolutionary Learning

Differentially Private Adapters for Parameter Efficient Acoustic Modeling

Federated Self-Learning with Weak Supervision for Speech Recognition

Domain Adaptation Using Factorized Hidden Layer for Robust Automatic Speech Recognition

Integrated Adaptation with Multi-Factor Joint-Learning for Far-Field Speech Recognition

Efficient Adapter Transfer of Self-Supervised Speech Models for Automatic Speech Recognition

Parameter-efficient Dysarthric Speech Recognition Using Adapter Fusion and Householder Transformation

Evaluating Parameter-Efficient Transfer Learning Approaches on SURE Benchmark for Speech Understanding

Resource-Efficient Adaptation of Speech Foundation Models for Multi-Speaker ASR

FedSP - Federated Speaker Verification with Personal Privacy Preservation.

Efficient Federated Learning with Pre-Trained Large Language Model Using Several Adapter Mechanisms

Efficient Transfer Learning Methods Using Parameterization in Few-Shot Speech Recognition

Leveraging Parameter-Efficient Transfer Learning for Multi-Lingual Text-to-Speech Adaptation

Federated Marginal Personalization for ASR Rescoring

Federated Learning with Differential Privacy for End-to-End Speech Recognition

Differentially Private Parameter-Efficient Fine-tuning for Large ASR Models

Efficient Domain Adaptation for Speech Foundation Models

Hyper-parameter Adaptation of Conformer ASR Systems for Elderly and Dysarthric Speech Recognition