Abstract:Research on multi-modal contrastive learning strategies for audio and text has rapidly gained interest. Contrastively trained Audio-Language Models (ALMs), such as CLAP, which establish a unified representation across audio and language modalities, have enhanced the efficacy in various subsequent tasks by providing good text aligned audio encoders and vice versa. These improvements are evident in areas like zero-shot audio classification and audio retrieval, among others. However, the ability of these models to understand natural language and temporal relations is still a largely unexplored and open field for research. In this paper, we propose to equip the multi-modal ALMs with temporal understanding without loosing their inherent prior capabilities of audio-language tasks with a temporal instillation method TeminAL. We implement a two-stage training scheme TeminAL A $\&$ B, where the model first learns to differentiate between multiple sounds in TeminAL A, followed by a phase that instills a sense of time, thereby enhancing its temporal understanding in TeminAL B. This approach results in an average performance gain of $5.28\%$ in temporal understanding on the ESC-50 dataset, while the model remains competitive in zero-shot retrieval and classification tasks on the AudioCap/Clotho datasets. We also note the lack of proper evaluation techniques for contrastive ALMs and propose a strategy for evaluating ALMs in zero-shot settings. The general-purpose zero-shot model evaluation strategy ZSTE, is used to evaluate various prior models. ZSTE demonstrates a general strategy to evaluate all ZS contrastive models. The model trained with TeminAL successfully outperforms current models on most downstream tasks.

TimbreCLIP: Connecting Timbre to Text and Images

CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained Language-Vision Models

Text2FX: Harnessing CLAP Embeddings for Text-Guided Audio Effects

Timbre Transfer with Variational Auto Encoding and Cycle-Consistent Adversarial Networks

Timbre-Trap: A Low-Resource Framework for Instrument-Agnostic Music Transcription

TimbreTron: A WaveNet(CycleGAN(CQT(Audio))) Pipeline for Musical Timbre Transfer

Generating Sample-Based Musical Instruments Using Neural Audio Codec Language Models

Accommodating Audio Modality in CLIP for Multimodal Processing

Timbre transfer using image-to-image denoising diffusion implicit models

The Timbre Toolbox: extracting audio descriptors from musical signals

SCRAPS: Speech Contrastive Representations of Acoustic and Phonetic Spaces

Intelligent Text-Conditioned Music Generation

Timbre Analysis of Music Audio Signals with Convolutional Neural Networks

Images that Sound: Composing Images and Sounds on a Single Canvas

Latent Diffusion Bridges for Unsupervised Musical Audio Timbre Transfer

Leveraging Pretrained Image-text Models for Improving Audio-Visual Learning

CLIPtone: Unsupervised Learning for Text-based Image Tone Adjustment

Vector-Quantized Timbre Representation

Enhancing Audio-Language Models through Self-Supervised Post-Training with Text-Audio Pairs

Wav2CLIP: Learning Robust Audio Representations From CLIP