Investigating Pooling Strategies and Loss Functions for Weakly-Supervised Text-to-Audio Grounding via Contrastive Learning.

Xuenan Xu,Mengyue Wu,Kai Yu
DOI: https://doi.org/10.1109/ICASSPW59220.2023.10192960
2023-01-01
Abstract:Text-to-audio grounding (TAG) aims to detect sound events described by natural language in an audio clip. Strongly-supervised TAG requires extensive human annotations of the events' on- and off-sets. To mitigate the reliance on strongly-annotated data, weakly-supervised TAG (WSTAG) is proposed to train TAG on audio captioning data based on contrastive learning. However, crucial components in WSTAG, namely pooling strategies and loss functions, remain unexplored. Directly bringing their corresponding ones in closely-related tasks, such as sound event detection (SED) and audio-text retrieval, do not necessarily fit this task due to TAG's unique requirement of fine-grained alignment via free text. In this work, we first improve the TAG dataset to obtain a more reliable TAG performance indicator, AudioGrounding v2. Then we extensively investigate the effects of these components on WSTAG. The result on the refined dataset demonstrates that the pooling strategy is crucial to the model performance while the loss function presents much less influence. By combining proper pooling strategies and loss functions, we explore a more effective WSTAG framework that significantly enhances the ability to detect events, especially for short-duration ones(1).
What problem does this paper attempt to address?