Compositional Audio Representation Learning

Sripathi Sridhar,Mark Cartwright
2024-09-15
Abstract:Human auditory perception is compositional in nature -- we identify auditory streams from auditory scenes with multiple sound events. However, such auditory scenes are typically represented using clip-level representations that do not disentangle the constituent sound sources. In this work, we learn source-centric audio representations where each sound source is represented using a distinct, disentangled source embedding in the audio representation. We propose two novel approaches to learning source-centric audio representations: a supervised model guided by classification and an unsupervised model guided by feature reconstruction, both of which outperform the baselines. We thoroughly evaluate the design choices of both approaches using an audio classification task. We find that supervision is beneficial to learn source-centric representations, and that reconstructing audio features is more useful than reconstructing spectrograms to learn unsupervised source-centric representations. Leveraging source-centric models can help unlock the potential of greater interpretability and more flexible decoding in machine listening.
Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the limitations of current audio representation models when dealing with multi - sound - source scenarios. Specifically, traditional audio representation methods usually use segment - level representations and are unable to distinguish different sound sources that make up an audio scene. This makes it difficult to perform source - level reasoning in applications such as individual identification, sound localization and tracking. Such limitations restrict the potential of machine hearing in these advanced tasks. To address this problem, the author proposes a new learning framework - **Source - Centric Audio Representation Learning (CARL)** - which aims to directly learn the semantic content of each sound source and encode it into the corresponding source embedding. In this way, the model can more flexibly decode the audio information required for different downstream tasks, thereby improving interpretability and flexibility. ### Main Problem Summary: 1. **Representation Problem of Multi - sound - source Audio Scenarios**: Traditional methods cannot effectively distinguish and represent each sound source in multi - sound - source audio scenarios. 2. **Limitations of Existing Methods**: Existing audio representation models mainly focus on segment - level representations and lack a fine - grained understanding of individual sound sources. 3. **Enhancing Machine Hearing Ability**: In order to make machine hearing better simulate the combinatorial nature of human hearing, methods that can understand and represent multiple sound sources need to be developed. ### Solutions: - Propose a new source - centric audio representation learning framework, CARL. - Learn source - centric audio representations through both supervised and unsupervised means. - Explore different design choices, such as reconstruction targets and decoder types, to optimize model performance. Through these methods, the author hopes to unlock more interpretable and flexible machine - hearing capabilities, especially to achieve better results in source - level reasoning tasks in complex audio environments.