Abstract:Real world sound is a mixture of different sources. The sound scene of a busy coffeehouse, for example, usually consists of several conversations, music playing, laughter and maybe a baby crying, the door being slammed, different machines operating in the background and more. When humans are confronted with these sounds, they rapidly and automatically adjust themselves in this complex sound environment, paying attention to the sound source of interest. This ability has been labeled in psychoacoustics under the name of Auditory Scene Analysis (ASA). The counterpart to ASA in machine listening is called Computational Auditory Scene Analysis (CASA) — the efforts to build computer models to perform auditory scene analysis. Research on CASA has led to great advancement in machine systems capable of analyzing complex sound scene, such as audio source separation and multiple pitch estimation. Such systems often fail to perform in presence of corrupted or incomplete sound scenes. In a real world sound scene, different sounds overlap in time and frequency, interfering with and canceling each other. Sometimes, the sound of interest may have some critical information totally missing, examples including an old recording from a scratched CD or a band-limited telephone speech signal. In the real world filled with incomplete sounds, the human auditory system has the ability, known as Auditory Scene Induction (ASI), to estimate the missing parts of a continuous auditory scene briefly covered by noise or other interferences, and perceptually resynthesize them. Since human is able to infer the missing elements in an auditory scene, it is important for machine systems to have the same function. However, there are very few efforts in computer audition to computationally realize this ability. This thesis focuses on the computational realization of auditory scene induction — Computational Auditory Scene Induction (CASI). More specifically, the goal of my research is to build computer models that are capable of resynthesizing the missing information of an audio scene. Building upon existing statistical models (NMF, PLCA, HMM and N-HMM) for audio representation, I will formulate this ability as a model-based spectrogram analysis and inference problem under the expectation–maximization (EM) framework with missing data in the observation. Various sources of information, including the spectral and temporal structure of audio, and the top-down knowledge about speech are incorporated into the proposed models to produce accurate reconstruction of the missing information in an audio scene. The effectiveness of these proposed machine systems are demonstrated on three audio signal processing tasks: singing melody extraction, audio imputation and audio bandwidth expansion. Each system is assessed through experiments on real world audio data and compared to the state-of-art. Although far from perfect, the proposed systems have shown many advantages and significant improvement over the existing systems. In addition, this thesis has shown that different applications related to missing audio data can be considered under the unified framework of CASI. This opened a new avenue of research in the Computer Audition community.

Computational Modeling of Environment Deviant Sound Detection Based on Human Auditory Cognitive Mechanism

Salient environmental sound detection framework for machine awareness.

Implementation of Abnormal Sound Detection in Intelligent Surveillance Front-end System

Simplified model for generating 3D realistic sound in the multimedia and virtual reality systems

Computational auditory scene induction

MIMII-Gen: Generative Modeling Approach for Simulated Evaluation of Anomalous Sound Detection System

VisibleSound: Perceiving Environmental Sound with 4D Form

Attention Driven Computational Model of the Auditory Midbrain for Sound Localization in Reverberant Environments.

Assessing Behavioral and Neural Correlates of Change Detection in Spatialized Acoustic Scenes

Auditory Attention Detection via Cross-Modal Attention

Experimental Analysis on Auditory Attention Saliency Calculation Models

Application of a model for auditory attention to the design of urban soundscapes

Linear Multivariate Evaluation Models for Spatial Perception of Soundscape.

Using Soundscape Model to Control Ambient Noise Based on Sound Preference Evaluation

Probing sensitivity to statistical structure in rapid sound sequences using deviant detection tasks

Sounding Bodies: Modeling 3D Spatial Sound of Humans Using Body Pose and Audio

Semantic-Physically Conflicting Speech Perception And Human Cognitive Principle Inspired Asr System Design

Algorithm of Pure Tone Audiometry Based on Multiple Judgment.

Unified Audio-Visual Saliency Model for Omnidirectional Videos with Spatial Audio

Context-based environmental audio event recognition for scene understanding

Rapid and Stimulus-Specific Deviance Detection in the Human Inferior Colliculus