Modeling the Synchrony between Audio and Visual Modalities for Speaker Identification

Yu WANG,WU Zhiyong,Lianhong CAI,Helen M. MENG
2008-01-01
Abstract:This work aims to understand and model the inter-modal temporal relations between the audio and visual modalities of speech and validate whether the captured relations can improve the performance of audio-visual bimodal modeling for such applications as audio-visual speaker identification. We propose to extend our audio-visual correlative model (AVCM) with explicit durational modeling of the partial temporal synchrony between the two speech modalities, i.e. where the audio may lead, lag or remain synchronized with the video. We refer to the new extended model as DurationalAVCM. Experiments on the CMU database and a homegrown database demonstrate that Durational-AVCM can improve the accuracies of audio-visual speaker identification at all levels of acoustic signal-to-noise ratios (SNR) from 0dB to 30dB with varying acoustic conditions compared to original AVCM model. The results indicate the importance of incorporating the partial temporal synchrony between audio and visual modalities for audio-visual bimodal modeling.
What problem does this paper attempt to address?