Abstract:Acoustic word embeddings are fixed-dimensional representations of variable-length speech segments. Such embeddings can form the basis for speech search, indexing and discovery systems when conventional speech recognition is not possible. In zero-resource settings where unlabelled speech is the only available resource, we need a method that gives robust embeddings on an arbitrary language. Here we explore multilingual transfer: we train a single supervised embedding model on labelled data from multiple well-resourced languages and then apply it to unseen zero-resource languages. We consider three multilingual recurrent neural network (RNN) models: a classifier trained on the joint vocabularies of all training languages; a Siamese RNN trained to discriminate between same and different words from multiple languages; and a correspondence autoencoder (CAE) RNN trained to reconstruct word pairs. In a word discrimination task on six target languages, all of these models outperform state-of-the-art unsupervised models trained on the zero-resource languages themselves, giving relative improvements of more than 30% in average precision. When using only a few training languages, the multilingual CAE performs better, but with more training languages the other multilingual models perform similarly. Using more training languages is generally beneficial, but improvements are marginal on some languages. We present probing experiments which show that the CAE encodes more phonetic, word duration, language identity and speaker information than the other multilingual models.

Multilingual Sequence-to-Sequence Speech Recognition: Architecture, Transfer Learning, and Language Modeling

Transfer learning of language-independent end-to-end ASR with language model fusion

Language-agnostic Multilingual Modeling

Monolingual Recognizers Fusion for Code-switching Speech Recognition

Configurable Multilingual ASR with Speech Summary Representations

Multilingual and Fully Non-Autoregressive ASR with Large Language Model Fusion: A Comprehensive Study

Multilingual Meta-Transfer Learning for Low-Resource Speech Recognition

Multi-Dialect Speech Recognition With A Single Sequence-To-Sequence Model

A Survey of Multilingual Models for Automatic Speech Recognition

Improved acoustic word embeddings for zero-resource languages using multilingual transfer

Learning Cross-lingual Mappings for Data Augmentation to Improve Low-Resource Speech Recognition

Parameter-efficient Adaptation of Multilingual Multimodal Models for Low-resource ASR

Semi-Supervised Transfer Learning for Language Expansion of End-to-End Speech Recognition Models to Low-Resource Languages

Advanced Recurrent Network-Based Hybrid Acoustic Models for Low Resource Speech Recognition

Automatic Call Routing with Multiple Language Models

Prompting Large Language Models with Speech Recognition Abilities

Leveraging native language information for improved accented speech recognition

Anatomy of Industrial Scale Multilingual ASR

Aligning Speech to Languages to Enhance Code-switching Speech Recognition

Low Resource Malay Dialect Automatic Speech Recognition Modeling Using Transfer Learning from a Standard Malay Model

Towards Language-Universal Mandarin-English Speech Recognition