Abstract:We live in a rich and varied acoustic world, which is experienced by individuals or communities as a soundscape. Computational auditory scene analysis, disentangling acoustic scenes by detecting and classifying events, focuses on objective attributes of sounds, such as their category and temporal characteristics, ignoring the effect of sounds on people and failing to explore the relationship between sounds and the emotions they evoke within a context. To fill this gap and to automate soundscape analysis, which traditionally relies on labour-intensive subjective ratings and surveys, we propose the soundscape captioning (SoundSCap) task. SoundSCap generates context-aware soundscape descriptions by capturing the acoustic scene, event information, and the corresponding human affective qualities. To this end, we propose an automatic soundscape captioner (SoundSCaper) composed of an acoustic model, SoundAQnet, and a general large language model (LLM). SoundAQnet simultaneously models multi-scale information about acoustic scenes, events, and perceived affective qualities, while LLM generates soundscape captions by parsing the information captured by SoundAQnet to a common language. The soundscape caption's quality is assessed by a jury of 16 audio/soundscape experts. The average score (out of 5) of SoundSCaper-generated captions is lower than the score of captions generated by two soundscape experts by 0.21 and 0.25, respectively, on the evaluation set and the model-unknown mixed external dataset with varying lengths and acoustic properties, but the differences are not statistically significant. Overall, SoundSCaper-generated captions show promising performance compared to captions annotated by soundscape experts. The models' code, LLM scripts, human assessment data and instructions, and expert evaluation statistics are all publicly available.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the lack of emotional information in automated soundscape analysis. Specifically, traditional soundscape analysis relies on time - consuming and labor - intensive manual subjective evaluations and surveys, mainly focusing on the objective properties of sounds, such as category and temporal characteristics, while ignoring the impact of sounds on people and the emotional responses they evoke in specific situations. To fill this research gap, the paper proposes the SoundSCap (Soundscape Description Task), aiming to generate context - aware soundscape descriptions by capturing soundscapes, event information, and the corresponding human emotional qualities. ### Core contributions of the paper: 1. **Proposing the SoundSCap (Soundscape Description Task)**: Generate soundscape descriptions in natural language from three perspectives: soundscapes (AS), audio events (AE), and emotion - related emotional qualities (AQ), thereby bridging the gap between audio descriptions and the emotional qualities perceived by humans. 2. **Designing the Multi - scale Graph Fusion Network (SoundAQnet)**: Simultaneously model coarse - grained soundscapes, fine - grained audio events, and human - perceived emotional qualities, and explore soundscape and emotional attributes at different temporal resolutions. 3. **Automatic Soundscape Descriptor (SoundSCaper) Based on Large - scale Language Models (LLM)**: Combine soundscapes, audio events, and emotional information to generate soundscape description texts that are easy for humans to understand, no longer limited to single numerical features. 4. **Introducing the Transparent Human Benchmark (THumBS)**: As a quality evaluation metric for the soundscape description task, verify the quality of the soundscape descriptions generated by SoundSCaper through human evaluations by 16 audio/soundscape experts. ### Specific problems solved: - **Lack of emotional information**: Traditional methods mainly focus on the objective properties of sounds and ignore the emotional impact of sounds on people. - **Automated soundscape analysis**: Reduce the dependence on manual subjective evaluations and improve the efficiency and accuracy of soundscape analysis. - **Modeling multi - scale information**: Simultaneously process soundscape and emotional information at different time scales to improve the robustness and generalization ability of the model. Through these innovations, the paper aims to enable machines to understand and describe soundscapes more comprehensively, not limited to technical - level identification and classification, but combined with emotional computing and context interpretation to provide more rich and detailed soundscape descriptions. This will help people understand soundscapes more deeply and be applied in fields such as the creation of virtual environments and urban soundscape planning.

Soundscape Captioning using Sound Affective Quality Network and Large Language Model

Utterance-Based Audio Sentiment Analysis Learned by a Parallel Combination of CNN and LSTM.

Audio Sentiment Analysis by Heterogeneous Signal Features Learned from Utterance-Based Parallel Neural Network.

SECap: Speech Emotion Captioning with Large Language Model

Seeing and Hearing Too: Audio Representation for Video Captioning.

Sound-VECaps: Improving Audio Generation with Visual Enhanced Captions

Improving Audio Captioning Models with Fine-grained Audio Features, Text Embedding Supervision, and LLM Mix-up Augmentation

AudioSetCaps: An Enriched Audio-Caption Dataset using Automated Generation Pipeline with Large Audio and Language Models

CLAIR-A: Leveraging Large Language Models to Judge Audio Captions

Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention

QAVidCap: Enhancing Video Captioning Through Question Answering Techniques

Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding

AI-based soundscape analysis: Jointly identifying sound sources and predicting annoyance

ACES: Evaluating Automated Audio Captioning Models on the Semantics of Sounds

EmotionCaps: Enhancing Audio Captioning Through Emotion-Augmented Data Generation

Exploring the Role of Audio in Video Captioning

SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMs

AudioLog: LLMs-Powered Long Audio Logging with Hybrid Token-Semantic Contrastive Learning

Challenge on Sound Scene Synthesis: Evaluating Text-to-Audio Generation

Investigations in Audio Captioning: Addressing Vocabulary Imbalance and Evaluating Suitability of Language-Centric Performance Metrics

ALCAP: Alignment-Augmented Music Captioner