Computer Audition: From Task-Specific Machine Learning to Foundation Models

Andreas Triantafyllopoulos,Iosif Tsangko,Alexander Gebhard,Annamaria Mesaros,Tuomas Virtanen,Björn Schuller
2024-07-22
Abstract:Foundation models (FMs) are increasingly spearheading recent advances on a variety of tasks that fall under the purview of computer audition -- the use of machines to understand sounds. They feature several advantages over traditional pipelines: among others, the ability to consolidate multiple tasks in a single model, the option to leverage knowledge from other modalities, and the readily-available interaction with human users. Naturally, these promises have created substantial excitement in the audio community, and have led to a wave of early attempts to build new, general-purpose foundation models for audio. In the present contribution, we give an overview of computational audio analysis as it transitions from traditional pipelines towards auditory foundation models. Our work highlights the key operating principles that underpin those models, and showcases how they can accommodate multiple tasks that the audio community previously tackled separately.
Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The paper primarily explores the shift in the field of computer audition from traditional task-specific machine learning methods to Foundation Models (FMs). Specifically, the paper aims to address the following key issues: 1. **Integration of Multiple Tasks**: Traditional methods often design specialized solutions for each task (such as acoustic scene classification, sound event detection, etc.), while foundation models attempt to handle multiple tasks within a unified framework, thereby improving efficiency and generality. 2. **Cross-Modal Knowledge Utilization**: Foundation models can better utilize information across different modalities, such as combining audio with text, to achieve richer representation and understanding capabilities. 3. **Human-Computer Interaction**: By using foundation models, it becomes easier to achieve direct interaction with human users, making the system more flexible and user-friendly. 4. **Transition from Traditional Methods to Foundation Models**: This paper details the technical specifics and development trends involved in this transition, emphasizing the advantages of foundation models in solving practical problems and future research directions. In summary, the goal of the paper is to demonstrate how foundation models can be used to tackle various types of tasks in the field of computer audition and to promote the development of this field towards a more integrated and efficient direction.