Soundscape Captioning using Sound Affective Quality Network and Large Language Model

Yuanbo Hou,Qiaoqiao Ren,Andrew Mitchell,Wenwu Wang,Jian Kang,Tony Belpaeme,Dick Botteldooren
2024-06-10
Abstract:We live in a rich and varied acoustic world, which is experienced by individuals or communities as a soundscape. Computational auditory scene analysis, disentangling acoustic scenes by detecting and classifying events, focuses on objective attributes of sounds, such as their category and temporal characteristics, ignoring the effect of sounds on people and failing to explore the relationship between sounds and the emotions they evoke within a context. To fill this gap and to automate soundscape analysis, which traditionally relies on labour-intensive subjective ratings and surveys, we propose the soundscape captioning (SoundSCap) task. SoundSCap generates context-aware soundscape descriptions by capturing the acoustic scene, event information, and the corresponding human affective qualities. To this end, we propose an automatic soundscape captioner (SoundSCaper) composed of an acoustic model, SoundAQnet, and a general large language model (LLM). SoundAQnet simultaneously models multi-scale information about acoustic scenes, events, and perceived affective qualities, while LLM generates soundscape captions by parsing the information captured by SoundAQnet to a common language. The soundscape caption's quality is assessed by a jury of 16 audio/soundscape experts. The average score (out of 5) of SoundSCaper-generated captions is lower than the score of captions generated by two soundscape experts by 0.21 and 0.25, respectively, on the evaluation set and the model-unknown mixed external dataset with varying lengths and acoustic properties, but the differences are not statistically significant. Overall, SoundSCaper-generated captions show promising performance compared to captions annotated by soundscape experts. The models' code, LLM scripts, human assessment data and instructions, and expert evaluation statistics are all publicly available.
Audio and Speech Processing,Sound,Signal Processing
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the lack of emotional information in automated soundscape analysis. Specifically, traditional soundscape analysis relies on time - consuming and labor - intensive manual subjective evaluations and surveys, mainly focusing on the objective properties of sounds, such as category and temporal characteristics, while ignoring the impact of sounds on people and the emotional responses they evoke in specific situations. To fill this research gap, the paper proposes the SoundSCap (Soundscape Description Task), aiming to generate context - aware soundscape descriptions by capturing soundscapes, event information, and the corresponding human emotional qualities. ### Core contributions of the paper: 1. **Proposing the SoundSCap (Soundscape Description Task)**: Generate soundscape descriptions in natural language from three perspectives: soundscapes (AS), audio events (AE), and emotion - related emotional qualities (AQ), thereby bridging the gap between audio descriptions and the emotional qualities perceived by humans. 2. **Designing the Multi - scale Graph Fusion Network (SoundAQnet)**: Simultaneously model coarse - grained soundscapes, fine - grained audio events, and human - perceived emotional qualities, and explore soundscape and emotional attributes at different temporal resolutions. 3. **Automatic Soundscape Descriptor (SoundSCaper) Based on Large - scale Language Models (LLM)**: Combine soundscapes, audio events, and emotional information to generate soundscape description texts that are easy for humans to understand, no longer limited to single numerical features. 4. **Introducing the Transparent Human Benchmark (THumBS)**: As a quality evaluation metric for the soundscape description task, verify the quality of the soundscape descriptions generated by SoundSCaper through human evaluations by 16 audio/soundscape experts. ### Specific problems solved: - **Lack of emotional information**: Traditional methods mainly focus on the objective properties of sounds and ignore the emotional impact of sounds on people. - **Automated soundscape analysis**: Reduce the dependence on manual subjective evaluations and improve the efficiency and accuracy of soundscape analysis. - **Modeling multi - scale information**: Simultaneously process soundscape and emotional information at different time scales to improve the robustness and generalization ability of the model. Through these innovations, the paper aims to enable machines to understand and describe soundscapes more comprehensively, not limited to technical - level identification and classification, but combined with emotional computing and context interpretation to provide more rich and detailed soundscape descriptions. This will help people understand soundscapes more deeply and be applied in fields such as the creation of virtual environments and urban soundscape planning.