Abstract:Audio-text retrieval is a challenging task, requiring the search for an audio clip or a text caption within a database. The predominant focus of existing research on English descriptions poses a limitation on the applicability of such models, given the abundance of non-English content in real-world data. To address these linguistic disparities, we propose a language enhancement (LE), using a multilingual text encoder (SONAR) to encode the text data with language-specific information. Additionally, we optimize the audio encoder through the application of consistent ensemble distillation (CED), enhancing support for variable-length audio-text retrieval. Our methodology excels in English audio-text retrieval, demonstrating state-of-the-art (SOTA) performance on commonly used datasets such as AudioCaps and Clotho. Simultaneously, the approach exhibits proficiency in retrieving content in seven other languages with only 10% of additional language-enhanced training data, yielding promising results. The source code is publicly available <a class="link-external link-https" href="https://github.com/zyyan4/ml-clap" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the limitations of existing audio - text retrieval systems when dealing with multilingual data. Specifically, most of the current audio - text retrieval models mainly focus on English descriptions, which restricts their wide applicability in practical applications because real - world data contains a large amount of non - English content. In addition, the existing audio encoders perform poorly when processing audio segments of variable lengths, which also affects the overall performance of the system. To solve these problems, the author makes the following two main contributions: 1. **Language Enhancement (LE)**: - Use a multilingual text encoder (such as SONAR) to encode text data to include language - specific information. In this way, training data in multiple languages can be generated, and these data can be used in retrieval tasks to bridge the gap between different languages. 2. **Optimize the audio encoder**: - Optimize the audio encoder by applying the Consistent Ensemble Distillation (CED) technique, thereby improving its performance in handling variable - length audio - text retrieval tasks. Through these improvements, this method not only achieves state - of - the - art performance on common English audio - text retrieval datasets (such as AudioCaps and Clotho), but also can effectively retrieve content in seven other languages with only 10% additional multilingual enhanced training data, demonstrating its strong ability in a multilingual environment. ### Formula Summary In this study, the author uses some key formulas to describe the model's training process and evaluation metrics: - **Generation of Embedding Vectors**: \[ e_a = E_A(A), \quad e_t = E_{MT}(T) \] \[ a = \text{Project}_A(e_a), \quad t = \text{Project}_{MT}(e_t) \] - **Similarity Score Calculation**: \[ s_{A \sim MT} = \frac{a^T \cdot t}{\|a\| \cdot \|t\|} \] - **InfoNCE Loss Function**: \[ L_{A \rightarrow MT}^{(i)} = -\log \frac{\exp(s_{A \sim MT}(i, i) / \tau)}{\sum_{j = 1}^N \exp(s_{A \sim MT}(i, j) / \tau)} \] \[ L_{MT \rightarrow A}^{(i)} = -\log \frac{\exp(s_{A \sim MT}(i, i) / \tau)}{\sum_{j = 1}^N \exp(s_{A \sim MT}(j, i) / \tau)} \] \[ L = \frac{1}{N} \sum_{i = 1}^N (L_{A \rightarrow MT}^{(i)} + L_{MT \rightarrow A}^{(i)}) \] where $\tau$ is a temperature hyperparameter. The combination of these methods and formulas enables the model to perform audio - text retrieval more effectively in a multilingual environment.

Bridging Language Gaps in Audio-Text Retrieval

AudioVSR: Enhancing Video Speech Recognition with Audio Data

Audio Retrieval with WavText5K and CLAP Training

Audio Retrieval with Natural Language Queries: A Benchmark Study

Retrieval-Augmented Text-to-Audio Generation

End-to-end Code-switched TTS with Mix of Monolingual Recordings.

Contrastive Latent Space Reconstruction Learning for Audio-Text Retrieval

Braille-to-Speech Generator: Audio Generation Based on Joint Fine-Tuning of CLIP and Fastspeech2

Looking and Listening: Audio Guided Text Recognition

Audio–text retrieval based on contrastive learning and collaborative attention mechanism

Language-Queried Target Sound Extraction Without Parallel Training Data

Matching Text and Audio Embeddings: Exploring Transfer-learning Strategies for Language-based Audio Retrieval

CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained Language-Vision Models

Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding

Advancing Natural-Language Based Audio Retrieval with PaSST and Large Audio-Caption Data Sets

Text-based Audio Retrieval by Learning from Similarities between Audio Captions

Exploring the Role of Audio in Video Captioning

SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMs

Bridging High-Quality Audio and Video via Language for Sound Effects Retrieval from Visual Queries

A SOUND APPROACH: Using Large Language Models to generate audio descriptions for egocentric text-audio retrieval

Do Audio-Language Models Understand Linguistic Variations?