Fusion of visual and acoustic signals for command-word recognition

R. Kober,U. Harz,J. Schiffers
DOI: https://doi.org/10.1109/icassp.1997.596233
2024-01-08
Abstract:We investigate the question of how the visual information of lip movement contributes to command-word recognition. The fusion of the acoustic and visual signal can be carried out either at the feature level or at the class level. Integration at the feature level means merging of the acoustic and visual features to yield a combined feature vector which is fed into a HMM-system. Fusion at the class level means separate classification of the two sources of information and combination of the classification results. An HMM classifier is used for the acoustic signal and three different classifiers (HMM, DTW and ClaRe) for the visual signal. The classification results are combined using the C4.5 decision tree classifier. The recognition rates of both fusion schemes are comparable. Both yield small improvements at high SNRs using the acoustic/visual system in comparison to the acoustic system alone. Larger improvements (up to 12%) result at low SNRs.
What problem does this paper attempt to address?