Abstract:Research on multi-modal contrastive learning strategies for audio and text has rapidly gained interest. Contrastively trained Audio-Language Models (ALMs), such as CLAP, which establish a unified representation across audio and language modalities, have enhanced the efficacy in various subsequent tasks by providing good text aligned audio encoders and vice versa. These improvements are evident in areas like zero-shot audio classification and audio retrieval, among others. However, the ability of these models to understand natural language and temporal relations is still a largely unexplored and open field for research. In this paper, we propose to equip the multi-modal ALMs with temporal understanding without loosing their inherent prior capabilities of audio-language tasks with a temporal instillation method TeminAL. We implement a two-stage training scheme TeminAL A $\&$ B, where the model first learns to differentiate between multiple sounds in TeminAL A, followed by a phase that instills a sense of time, thereby enhancing its temporal understanding in TeminAL B. This approach results in an average performance gain of $5.28\%$ in temporal understanding on the ESC-50 dataset, while the model remains competitive in zero-shot retrieval and classification tasks on the AudioCap/Clotho datasets. We also note the lack of proper evaluation techniques for contrastive ALMs and propose a strategy for evaluating ALMs in zero-shot settings. The general-purpose zero-shot model evaluation strategy ZSTE, is used to evaluate various prior models. ZSTE demonstrates a general strategy to evaluate all ZS contrastive models. The model trained with TeminAL successfully outperforms current models on most downstream tasks.

Audio–text retrieval based on contrastive learning and collaborative attention mechanism

Cacophony: An Improved Contrastive Audio-Text Model

Contrastive Latent Space Reconstruction Learning for Audio-Text Retrieval

Speaker-Text Retrieval via Contrastive Learning

Retrieval-Augmented Text-to-Audio Generation

Audio-Text Models Do Not Yet Leverage Natural Language

Text-based Audio Retrieval by Learning from Similarities between Audio Captions

Improving Audio-Text Retrieval via Hierarchical Cross-Modal Interaction and Auxiliary Captions

Multiscale Matching Driven by Cross-Modal Similarity Consistency for Audio-Text Retrieval

On Metric Learning for Audio-Text Cross-Modal Retrieval

Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation

Voice Keyword Retrieval Method Using Attention Mechanism and Multimodal Information Fusion

Bridging Language Gaps in Audio-Text Retrieval

Dissecting Temporal Understanding in Text-to-Audio Retrieval

Introducing Auxiliary Text Query-modifier to Content-based Audio Retrieval

Unsupervised Improvement of Audio-Text Cross-Modal Representations

Audio Retrieval with WavText5K and CLAP Training

Enhancing Audio-Language Models through Self-Supervised Post-Training with Text-Audio Pairs

Advancing Multi-grained Alignment for Contrastive Language-Audio Pre-training

Estimated Audio-Caption Correspondences Improve Language-Based Audio Retrieval

Audio Contrastive based Fine-tuning