Vision-referential speech enhancement of an audio signal using mask information captured as visual data

Mitsuharu Matsumoto
DOI: https://doi.org/10.1121/1.5087563
Abstract:This paper describes a vision-referential speech enhancement of an audio signal using mask information captured as visual data. Smartphones and tablet devices have become popular in recent years. Most of them not only have a microphone but also a camera. Although the frame rate of the camera in such devices is very low compared to the audio signal from the microphone, it will be useful to enhance the speech signal if both signals are used adequately. In the proposed method, the speaker broadcasts not only his/her speech signal through a loudspeaker but also its mask information through a display. The receiver can enhance the speech combining the speech signal captured by the microphone and the reference signal captured by the camera. Some experiments were conducted to evaluate the effectiveness of the proposed method compared to a typical sparse approach. It was confirmed that the speech could be enhanced even when there were different kinds of noise and a high level of real noise in the environments. Experiments were also conducted to check the sound quality of the proposed method. They were compared to clear audio data compressed with various bps mp3 format. The sound quality was sufficient for practical application.
What problem does this paper attempt to address?