Abstract:The audio-visual segmentation (AVS) task aims to segment sounding objects from a given video. Existing works mainly focus on fusing audio and visual features of a given video to achieve sounding object masks. However, we observed that prior arts are prone to segment a certain salient object in a video regardless of the audio information. This is because sounding objects are often the most salient ones in the AVS dataset. Thus, current AVS methods might fail to localize genuine sounding objects due to the dataset bias. In this work, we present an audio-visual instance-aware segmentation approach to overcome the dataset bias. In a nutshell, our method first localizes potential sounding objects in a video by an object segmentation network, and then associates the sounding object candidates with the given audio. We notice that an object could be a sounding object in one video but a silent one in another video. This would bring ambiguity in training our object segmentation network as only sounding objects have corresponding segmentation masks. We thus propose a silent object-aware segmentation objective to alleviate the ambiguity. Moreover, since the category information of audio is unknown, especially for multiple sounding sources, we propose to explore the audio-visual semantic correlation and then associate audio with potential objects. Specifically, we attend predicted audio category scores to potential instance masks and these scores will highlight corresponding sounding instances while suppressing inaudible ones. When we enforce the attended instance masks to resemble the ground-truth mask, we are able to establish audio-visual semantics correlation. Experimental results on the AVS benchmarks demonstrate that our method can effectively segment sounding objects without being biased to salient objects.

Audio Segmentation Based On Multi-Scale Audio Classification

A Two-Stage Content-Based Audio Segmentation Algorithm

A Novel Classification-Based Audio Segmentation Algorithm

Time-Frequency Spectrogram Segmentation Using the Multi-Scale Morphological Gradient and the Marked Watershed Algorithm

Audio Segmentation in AAC Domain for Content Analysis

Robust Audio Sensing with Multi-Sound Classification.

Object Segmentation with Audio Context

The Scale-Span Classification Research for Multispectral Images Based on the Homogeneous-Region

An Effective Vocal/Non-vocal Segmentation Approach for Embedded Music Retrieve System on Mobile Phone

Efficient Audio Stream Segmentation Via the Combined T-2 Statistic and Bayesian Information Criterion

Segmentation of Heart Sound Signal Based on Multi-Scale Feature Fusion and Multi-Classification of Congenital Heart Disease

Hierarchical Support Vector Machines for Audio Classification

Image segmentation based on multiscale fast spectral clustering

Music Content Authentication Based on Beat Segmentation and Fuzzy Classification

Audio-Visual Segmentation

Multi-scale Multi-instance Visual Sound Localization and Segmentation

Audio-Visual Segmentation by Exploring Cross-Modal Mutual Semantics

A speech/music discriminator based on RMS and zero-crossings

Audio-Visual Segmentation with Semantics

Music/speech Classification Using High-Level Features Derived from Fmri Brain Imaging.