Bridging Language Gaps in Audio-Text Retrieval

Zhiyong Yan,Heinrich Dinkel,Yongqing Wang,Jizhong Liu,Junbo Zhang,Yujun Wang,Bin Wang
2024-06-17
Abstract:Audio-text retrieval is a challenging task, requiring the search for an audio clip or a text caption within a database. The predominant focus of existing research on English descriptions poses a limitation on the applicability of such models, given the abundance of non-English content in real-world data. To address these linguistic disparities, we propose a language enhancement (LE), using a multilingual text encoder (SONAR) to encode the text data with language-specific information. Additionally, we optimize the audio encoder through the application of consistent ensemble distillation (CED), enhancing support for variable-length audio-text retrieval. Our methodology excels in English audio-text retrieval, demonstrating state-of-the-art (SOTA) performance on commonly used datasets such as AudioCaps and Clotho. Simultaneously, the approach exhibits proficiency in retrieving content in seven other languages with only 10% of additional language-enhanced training data, yielding promising results. The source code is publicly available <a class="link-external link-https" href="https://github.com/zyyan4/ml-clap" rel="external noopener nofollow">this https URL</a>.
Sound,Computation and Language,Audio and Speech Processing
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the limitations of existing audio - text retrieval systems when dealing with multilingual data. Specifically, most of the current audio - text retrieval models mainly focus on English descriptions, which restricts their wide applicability in practical applications because real - world data contains a large amount of non - English content. In addition, the existing audio encoders perform poorly when processing audio segments of variable lengths, which also affects the overall performance of the system. To solve these problems, the author makes the following two main contributions: 1. **Language Enhancement (LE)**: - Use a multilingual text encoder (such as SONAR) to encode text data to include language - specific information. In this way, training data in multiple languages can be generated, and these data can be used in retrieval tasks to bridge the gap between different languages. 2. **Optimize the audio encoder**: - Optimize the audio encoder by applying the Consistent Ensemble Distillation (CED) technique, thereby improving its performance in handling variable - length audio - text retrieval tasks. Through these improvements, this method not only achieves state - of - the - art performance on common English audio - text retrieval datasets (such as AudioCaps and Clotho), but also can effectively retrieve content in seven other languages with only 10% additional multilingual enhanced training data, demonstrating its strong ability in a multilingual environment. ### Formula Summary In this study, the author uses some key formulas to describe the model's training process and evaluation metrics: - **Generation of Embedding Vectors**: \[ e_a = E_A(A), \quad e_t = E_{MT}(T) \] \[ a = \text{Project}_A(e_a), \quad t = \text{Project}_{MT}(e_t) \] - **Similarity Score Calculation**: \[ s_{A \sim MT} = \frac{a^T \cdot t}{\|a\| \cdot \|t\|} \] - **InfoNCE Loss Function**: \[ L_{A \rightarrow MT}^{(i)} = -\log \frac{\exp(s_{A \sim MT}(i, i) / \tau)}{\sum_{j = 1}^N \exp(s_{A \sim MT}(i, j) / \tau)} \] \[ L_{MT \rightarrow A}^{(i)} = -\log \frac{\exp(s_{A \sim MT}(i, i) / \tau)}{\sum_{j = 1}^N \exp(s_{A \sim MT}(j, i) / \tau)} \] \[ L = \frac{1}{N} \sum_{i = 1}^N (L_{A \rightarrow MT}^{(i)} + L_{MT \rightarrow A}^{(i)}) \] where $\tau$ is a temperature hyperparameter. The combination of these methods and formulas enables the model to perform audio - text retrieval more effectively in a multilingual environment.