End-to-End Speech Recognition from Federated Acoustic Models

Yan Gao,Titouan Parcollet,Salah Zaiem,Javier Fernandez-Marques,Pedro P. B. de Gusmao,Daniel J. Beutel,Nicholas D. Lane
DOI: https://doi.org/10.1109/icassp43922.2022.9747161
2022-05-23
Abstract:Training Automatic Speech Recognition (ASR) models under federated learning (FL) settings has attracted a lot of attention recently. However, the FL scenarios often presented in the literature are artificial and fail to capture the complexity of real FL systems. In this paper, we construct a challenging and realistic ASR federated experimental setup consisting of clients with heterogeneous data distributions using the French and Italian sets of the CommonVoice dataset, a large heterogeneous dataset containing thousands of different speakers, acoustic environments and noises. We present the first empirical study on an attention-based sequence-to-sequence End-to-End (E2E) ASR model with three aggregation weighting strategies – standard FedAvg, loss-based aggregation and a novel word error rate (WER)-based aggregation, compared in two realistic FL scenarios: cross-silo with 10 clients and cross-device with 2K and 4K clients. This 4K cross-device ASR experiment is the largest ever performed. Our first-of-its-kind analysis on E2E ASR from heterogeneous and realistic federated acoustic models provides the foundations for future research and development of realistic FL ASR applications.
What problem does this paper attempt to address?