Mmmic: Multi-modal Speech Recognition Based on Mmwave Radar.

Long Fan,Lei Xie,Xinran Lu,Yi Li,Chuyu Wang,Sanglu Lu
DOI: https://doi.org/10.1109/infocom53939.2023.10229085
2023-01-01
Abstract:With the proliferation of voice assistants, microphone-based speech recognition technology usually cannot achieve good performance in the situation of multiple sound sources and ambient noises. In this paper, we propose a novel mmWave-based solution to perform speech recognition to tackle the issues of multiple sound sources and ambient noises, by precisely extracting the multi-modal features from lip motion and vocal-cords vibration from the single channel of mmWave. We propose a difference-based method for feature extraction of lip motion to suppress the dynamic interference from body motion and head motion. We propose a speech detection method based on cross-validation of lip motion and vocal-cords vibration so as to avoid wasting computing resources on nonspeaking activities. We propose a multi-modal fusion framework for speech recognition by fusing the signal features from lip motion and vocal-cords vibration with the attention mechanism. We implemented a prototype system and evaluated the performance in real test-beds. Experiment results show that the average speech recognition accuracy is 92.8% in realistic environments.
What problem does this paper attempt to address?