Sound Localization by Self-Supervised Time Delay Estimation

Ziyang Chen,David F. Fouhey,Andrew Owens
DOI: https://doi.org/10.48550/arXiv.2204.12489
2023-01-29
Abstract:Sounds reach one microphone in a stereo pair sooner than the other, resulting in an interaural time delay that conveys their directions. Estimating a sound's time delay requires finding correspondences between the signals recorded by each microphone. We propose to learn these correspondences through self-supervision, drawing on recent techniques from visual tracking. We adapt the contrastive random walk of Jabri et al. to learn a cycle-consistent representation from unlabeled stereo sounds, resulting in a model that performs on par with supervised methods on "in the wild" internet recordings. We also propose a multimodal contrastive learning model that solves a visually-guided localization task: estimating the time delay for a particular person in a multi-speaker mixture, given a visual representation of their face. Project site: <a class="link-external link-https" href="https://ificl.github.io/stereocrw/" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition,Sound,Audio and Speech Processing
What problem does this paper attempt to address?