Multilingual acoustic word embedding models for processing zero-resource languages

Herman Kamper,Yevgen Matusevych,Sharon Goldwater
DOI: https://doi.org/10.48550/arXiv.2002.02109
2020-02-21
Abstract:Acoustic word embeddings are fixed-dimensional representations of variable-length speech segments. In settings where unlabelled speech is the only available resource, such embeddings can be used in "zero-resource" speech search, indexing and discovery systems. Here we propose to train a single supervised embedding model on labelled data from multiple well-resourced languages and then apply it to unseen zero-resource languages. For this transfer learning approach, we consider two multilingual recurrent neural network models: a discriminative classifier trained on the joint vocabularies of all training languages, and a correspondence autoencoder trained to reconstruct word pairs. We test these using a word discrimination task on six target zero-resource languages. When trained on seven well-resourced languages, both models perform similarly and outperform unsupervised models trained on the zero-resource languages. With just a single training language, the second model works better, but performance depends more on the particular training--testing language pair.
Computation and Language,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to effectively perform speech processing in zero - resource languages without annotated data, especially how to generate high - quality acoustic word embeddings. Specifically, the researchers proposed a method of using multilingual supervised models to train acoustic word embeddings. These models can be trained on languages with abundant annotated data and then applied to zero - resource languages without annotated data. This method aims to overcome the challenge of difficultly collecting a large amount of annotated data in low - resource languages and at the same time improve the performance of zero - resource language processing tasks, such as example - based speech search, indexing and discovery systems. Two multilingual recurrent neural network models are proposed and tested in the paper: one is a discriminative classifier, and the other is a corresponding auto - encoder. Both of these two models are trained on the annotated data of multiple resource - rich languages and then applied to unseen zero - resource languages. The experimental results show that when trained with seven resource - rich languages, the performance of these two models is similar and better than that of the unsupervised model trained only on zero - resource languages. In addition, when trained with only one resource - rich language, the corresponding auto - encoder model performs better, but its performance is more dependent on specific training - testing language pairs.