Enhance audio-visual segmentation with hierarchical encoder and audio guidance

Cunhan Guo,Heyan Huang,Yanghao Zhou
DOI: https://doi.org/10.1016/j.neucom.2024.127885
IF: 6
2024-05-20
Neurocomputing
Abstract:As one of the pivotal technologies leading towards embodied intelligence, audio-visual segmentation is geared towards achieving precise segmentation of sounding objects, offering vast application prospects in scenarios such as emergency rescue and natural exploration. Nevertheless, the performance of audio-visual segmentation technology encounters limitations stemming from challenges related to the adaptation and fusion of cross-modal information encoding, as well as the decoding and generation of masks. To address these issues, this paper explores the adaptation of multi-modal information based on a shared encoder by employing a neural architecture search method to design a hierarchical encoder cooperation module for enhanced information interaction. An intermediate loss is leveraged to help the encoder to keep spatial knowledge reserved. Furthermore, an audio-guided class-aware decoder is devised to guide the generation of masks. Our approach has yielded competitive experimental results across multiple datasets, thus substantiating its effectiveness.
computer science, artificial intelligence
What problem does this paper attempt to address?