Abstract:This study addresses a framework for a robot audition system, including sound source localization (SSL) and sound source separation (SSS), that can robustly recognize simultaneous speeches in a real environment. Because SSL estimates not only the location of speakers but also the number of speakers, such a robust framework is essential for simultaneous speech recognition. Moreover, improvement in the performance of SSS is crucial for simultaneous speech recognition because the robot has to recognize the individual source of speeches. For simultaneous speech recognition, current robot audition systems mainly require noise-robustness, high resolution, and real-time implementation. Multiple signal classification (MUSIC) based on standard Eigenvalue decomposition (SEVD) and Geometric-constrained high-order decorrelation-based source separation (GHDSS) are techniques utilizing microphone array processing, which are used for SSL and SSS, respectively. To enhance SSL robustness against noise while detecting simultaneous speeches, we improved SEVDMUSIC by incorporating generalized Eigenvalue decomposition (GEVD). However, GEVD-based MUSIC (GEVD-MUSIC) and GHDSS mainly have two issues: (1) the resolution of pre-measured Transfer Functions (TFs) determines the resolution of SSL and SSS and (2) their computational cost is expensive for real-time processing. For the first issue, we propose a TF-interpolation method integrating time-domain-based and frequency-domain-based interpolation. The interpolation achieves super-resolution robot audition, which has a higher resolution than that of the pre-measured TFs. For the second issue, we propose two methods for SSL: MUSIC based on generalized singular value decomposition (GSVD-MUSIC) and hierarchical SSL (H-SSL). GSVD-MUSIC drastically reduces the computational cost while maintaining noise-robustness for localization. In addition, H-SSL reduces the computational cost by introducing a hierarchical search algorithm instead of using a greedy search for localization. These techniques are integrated into a robot audition system using a robot-embedded microphone array. The preliminary experiments for each technique showed the following: (1) The proposed interpolation achieved approximately 1-degree resolution although the TFs are only at 30-degree intervals in both SSL and SSS; (2) GSVD-MUSIC attained 46.4% and

Robot Audition and Computational Auditory Scene Analysis

Guest Editorial: AI for Computational Audition—sound and Music Processing

Sound Source Localization Sound Source Separation Acoustic Feature Extraction Automatic Speech Recognition Acoustic Signal ASR Results Figure 1 : Process Flow of Robot Audition

Acoustic Simulation in Dynamic Environments for Robot Audition

ODAS: Open embeddeD Audition System

Building Ears for Robots: Machine Hearing in the Age of Autonomy

An Open Platform Of Auditory Perception For Home Service Robots

Deep Neural Object Analysis by Interactive Auditory Exploration with a Humanoid Robot

No More Mumbles: Enhancing Robot Intelligibility through Speech Adaptation

A survey of sound source localization for robot audition

A Computer-Assisted Tool for Automatically Measuring Non-Native Japanese Oral Proficiency

Evaluating Speech-in-Speech Perception via a Humanoid Robot

Effects of Robot Sound on Auditory Localization in Human-Robot Collaboration

On-line Sound Event Detection and Recognition Based on Adaptive Background Model for Robot Audition

Practical Robotic Auditory Perception and Approaching Methods Based on Small-sized Microphone Array

Context-aware Sound Event Recognition for Home Service Robots.

Human-assisted Sound Event Recognition for Home Service Robots.

Robotic voice assistant equipped with binaural audio

Read the Room: Adapting a Robot's Voice to Ambient and Social Contexts