A Two-Stage Framework for Multiple Sound-Source Localization
Rui Qian,Di Hu,Heinrich Dinkel,Mengyue Wu,Ning Xu,Weiyao Lin
2020-01-01
Abstract:Humans usually perceive the world through information in different modalities, e.g., vision and hearing. By leveraging the relevance and complementary between audio and vision, humans can clearly distinguish different sound sources and infer which object is making sound. In contrast, machines have been proven capable of separately processing audio and visual information using deep neural networks. But can they benefit from joint audiovisual learning? Recent works mainly focus on establishing multi-modal relationship based on temporally synchronized audio and visual signals [1, 3, 8]. This synchronization works effectively for simple scenes [2, 9], i.e., the single-source conditions. However, in unconstrained videos, various sounds are usually mixed, where the scene-level supervision is too coarse to provide the precise alignment between each sound and visual source pair. To tackle this problem, [6, 7] establish audiovisual clusters to associate sound-object pairs, but require pre-determined number of clusters, which is difficult in unconstrained scenarios, thus greatly affecting alignment performance. [2, 9, 11] further apply audiovisual learning into sound localization, but mainly focus on simple scenes, usually unable to find source-specific objects from mixed audio. [13] constructs a pretext task then localizes sound through energy of each pixel. To sum up, existing dominant methods mostly lack the ability to analyze complex audiovisual scenes, and fail to effectively utilize the latent alignment between sound and visual source pairs in unconstrained videos. This is because there are majorly two challenges in complex audiovisual scene analysis: one is how to distinguish different soundsources, the other is how to ensure the established soundobject alignment is fairly satisfactory without one-to-one annotations. To address these challenges, we develop a twostage audiovisual learning framework. At the first stage, we employ a multi-task framework consisting of classification and audiovisual correspondence to provide the reference of audiovisual content for the second stage. At the second stage, based on the classification predictions, we use the operation of Class Activation Mapping (CAM) [14, 10, 4] to extract class-specific feature representations as the potential sound-object pairs (Fig. 1), then perform alignment in Shouting Boating Stream Audiovisual Pair