Abstract:Research on multi-modal contrastive learning strategies for audio and text has rapidly gained interest. Contrastively trained Audio-Language Models (ALMs), such as CLAP, which establish a unified representation across audio and language modalities, have enhanced the efficacy in various subsequent tasks by providing good text aligned audio encoders and vice versa. These improvements are evident in areas like zero-shot audio classification and audio retrieval, among others. However, the ability of these models to understand natural language and temporal relations is still a largely unexplored and open field for research. In this paper, we propose to equip the multi-modal ALMs with temporal understanding without loosing their inherent prior capabilities of audio-language tasks with a temporal instillation method TeminAL. We implement a two-stage training scheme TeminAL A $\&$ B, where the model first learns to differentiate between multiple sounds in TeminAL A, followed by a phase that instills a sense of time, thereby enhancing its temporal understanding in TeminAL B. This approach results in an average performance gain of $5.28\%$ in temporal understanding on the ESC-50 dataset, while the model remains competitive in zero-shot retrieval and classification tasks on the AudioCap/Clotho datasets. We also note the lack of proper evaluation techniques for contrastive ALMs and propose a strategy for evaluating ALMs in zero-shot settings. The general-purpose zero-shot model evaluation strategy ZSTE, is used to evaluate various prior models. ZSTE demonstrates a general strategy to evaluate all ZS contrastive models. The model trained with TeminAL successfully outperforms current models on most downstream tasks.

Audio Contrastive based Fine-tuning

AAT: Adapting Audio Transformer for Various Acoustics Recognition Tasks

Advancing Multi-grained Alignment for Contrastive Language-Audio Pre-training

Audio–text retrieval based on contrastive learning and collaborative attention mechanism

Learning Speech Representation From Contrastive Token-Acoustic Pretraining

Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation

Audio-free Prompt Tuning for Language-Audio Models

Cacophony: An Improved Contrastive Audio-Text Model

Optimizing Audio Augmentations for Contrastive Learning of Health-Related Acoustic Signals

Contrastive Learning for improving End-to-end Speaker Verification

Prior-agnostic Multi-scale Contrastive Text-Audio Pre-training for Parallelized TTS Frontend Modeling

Exploring the Role of Audio in Video Captioning

Anomalous Sound Detection using Audio Representation with Machine ID based Contrastive Learning Pretraining

Curriculum-Listener: Consistency- and Complementarity-Aware Audio-Enhanced Temporal Sentence Grounding

Towards Robust Few-shot Class Incremental Learning in Audio Classification using Contrastive Representation

Improving Speaker Representations Using Contrastive Losses on Multi-scale Features

Advancing Test-Time Adaptation in Wild Acoustic Test Settings

Semi-supervised Feature Selection for Audio Classification Based on Constraint Compensated Laplacian Score

Audio Enhancement for Computer Audition—An Iterative Training Paradigm Using Sample Importance

Enhancing Audio-Language Models through Self-Supervised Post-Training with Text-Audio Pairs

A contrastive-learning approach for auditory attention detection