Abstract:Humans can picture a sound scene given an imprecise natural language description. For example, it is easy to imagine an acoustic environment given a phrase like "the lion roar came from right behind me!". For a machine to have the same degree of comprehension, the machine must know what a lion is (semantic attribute), what the concept of "behind" is (spatial attribute) and how these pieces of linguistic information align with the semantic and spatial attributes of the sound (what a roar sounds like when its coming from behind). State-of-the-art audio foundation models which learn to map between audio scenes and natural textual descriptions, are trained on non-spatial audio and text pairs, and hence lack spatial awareness. In contrast, sound event localization and detection models are limited to recognizing sounds from a fixed number of classes, and they localize the source to absolute position (e.g., 0.2m) rather than a position described using natural language (e.g., "next to me"). To address these gaps, we present ELSA a spatially aware-audio and text embedding model trained using multimodal contrastive learning. ELSA supports non-spatial audio, spatial audio, and open vocabulary text captions describing both the spatial and semantic components of sound. To train ELSA: (a) we spatially augment the audio and captions of three open-source audio datasets totaling 4,738 hours of audio, and (b) we design an encoder to capture the semantics of non-spatial audio, and the semantics and spatial attributes of spatial audio using contrastive learning. ELSA is competitive with state-of-the-art for both semantic retrieval and 3D source localization. In particular, ELSA achieves +2.8% mean audio-to-text and text-to-audio R@1 above the baseline, and outperforms by -11.6° mean-absolute-error in 3D source localization over the baseline.

Leveraging Sound Local and Global Features for Language-Queried Target Sound Extraction.

Audio Sentiment Analysis by Heterogeneous Signal Features Learned from Utterance-Based Parallel Neural Network.

Utterance-Based Audio Sentiment Analysis Learned by a Parallel Combination of CNN and LSTM.

Language-Queried Target Sound Extraction Without Parallel Training Data

Leveraging Language Model Capabilities for Sound Event Detection

Leveraging LLM and Text-Queried Separation for Noise-Robust Sound Event Detection

Separate Anything You Describe

Selective Listening by Synchronizing Speech with Lips

Sound source localization based on residual network and channel attention module

Selector-Enhancer: Learning Dynamic Selection of Local and Non-local Attention Operation for Speech Enhancement

Learning Spatially-Aware Language and Audio Embeddings

CLAPSep: Leveraging Contrastive Pre-trained Model for Multi-Modal Query-Conditioned Target Sound Extraction

Enhancing Sound Source Localization via False Negative Elimination

A Feature Integration Network for Multi-Channel Speech Enhancement

Focus on the Sound around You: Monaural Target Speaker Extraction via Distance and Speaker Information

Joint Spatio-Temporal-Frequency Representation Learning for Improved Sound Event Localization and Detection

SoloAudio: Target Sound Extraction with Language-oriented Audio Diffusion Transformer

Can Large Language Models Understand Spatial Audio?

Exploring Text-Queried Sound Event Detection with Audio Source Separation

CASE-Net: Integrating local and non-local attention operations for speech enhancement