Audio-Visual Beat Tracking Based on a State-Space Model for a Robot Dancer Performing with a Human Dancer

Misato Ohkita,Yoshiaki Bando,Eita Nakamura,Katsutoshi Itoyama,Kazuyoshi Yoshii,
DOI: https://doi.org/10.20965/jrm.2017.p0125
2017-02-20
Journal of Robotics and Mechatronics
Abstract:[abstFig src='/00290001/12.jpg' width='300' text='An overview of real-time audio-visual beat-tracking for music audio signals and human dance moves' ] This paper presents a real-time beat-tracking method that integrates audio and visual information in a probabilistic manner to enable a humanoid robot to dance in synchronization with music and human dancers. Most conventional music robots have focused on either music audio signals or movements of human dancers to detect and predict beat times in real time. Since a robot needs to record music audio signals with its own microphones, however, the signals are severely contaminated with loud environmental noise. To solve this problem, we propose a state-space model that encodes a pair of a tempo and a beat time in a state-space and represents how acoustic and visual features are generated from a given state. The acoustic features consist of tempo likelihoods and onset likelihoods obtained from music audio signals and the visual features are tempo likelihoods obtained from dance movements. The current tempo and the next beat time are estimated in an online manner from a history of observed features by using a particle filter. Experimental results show that the proposed multi-modal method using a depth sensor (Kinect) to extract skeleton features outperformed conventional mono-modal methods in terms of beat-tracking accuracy in a noisy and reverberant environment.
What problem does this paper attempt to address?