Abstract:Acoustic word embeddings are fixed-dimensional representations of variable-length speech segments. Such embeddings can form the basis for speech search, indexing and discovery systems when conventional speech recognition is not possible. In zero-resource settings where unlabelled speech is the only available resource, we need a method that gives robust embeddings on an arbitrary language. Here we explore multilingual transfer: we train a single supervised embedding model on labelled data from multiple well-resourced languages and then apply it to unseen zero-resource languages. We consider three multilingual recurrent neural network (RNN) models: a classifier trained on the joint vocabularies of all training languages; a Siamese RNN trained to discriminate between same and different words from multiple languages; and a correspondence autoencoder (CAE) RNN trained to reconstruct word pairs. In a word discrimination task on six target languages, all of these models outperform state-of-the-art unsupervised models trained on the zero-resource languages themselves, giving relative improvements of more than 30% in average precision. When using only a few training languages, the multilingual CAE performs better, but with more training languages the other multilingual models perform similarly. Using more training languages is generally beneficial, but improvements are marginal on some languages. We present probing experiments which show that the CAE encodes more phonetic, word duration, language identity and speaker information than the other multilingual models.

Automatically Identifying Language Family from Acoustic Examples in Low Resource Scenarios

Low-Resource Language Identification From Speech Using Transfer Learning

Acoustic Modeling Based on Deep Learning for Low-Resource Speech Recognition: An Overview

Acoustics Based Intent Recognition Using Discovered Phonetic Units for Low Resource Languages

Articulatory Feature Based Multilingual MLPs for Low-Resource Speech Recognition.

Multilingual acoustic word embedding models for processing zero-resource languages

Quantifying Language Variation Acoustically with Few Resources

Investigating the Impact of Cross-lingual Acoustic-Phonetic Similarities on Multilingual Speech Recognition

Exploiting Cross-Lingual Knowledge in Unsupervised Acoustic Modeling for Low-Resource Languages

Language-invariant Bottleneck Features from Adversarial End-to-end Acoustic Models for Low Resource Speech Recognition.

Massively Parallel Cross-Lingual Learning in Low-Resource Target Language Translation

Improved acoustic word embeddings for zero-resource languages using multilingual transfer

Language-Universal Phonetic Representation in Multilingual Speech Pretraining for Low-Resource Speech Recognition

A Comparative Study of LLM-based ASR and Whisper in Low Resource and Code Switching Scenario

Multilingual acoustic word embeddings for zero-resource languages

Cross-Lingual and Ensemble MLPs Strategies for Low-Resource Speech Recognition

Audio-Based Linguistic Feature Extraction for Enhancing Multi-lingual and Low-Resource Text-to-Speech

Targeted Multilingual Adaptation for Low-resource Language Families

NatureLM-audio: an Audio-Language Foundation Model for Bioacoustics

Creating Spoken Dialog Systems in Ultra-Low Resourced Settings