Silent speech recognition using data augmentation based on a three-dimensional lip model

Kenko Ota
DOI: https://doi.org/10.1121/10.0023437
2023-10-01
The Journal of the Acoustical Society of America
Abstract:In this study, we proposed a data augmentation method using a three-dimensional model of a speaker's face for machine lip reading, which estimates the content of speech without using speech data. The proposed method converted a speaker's face into a three-dimensional model using DECA (Detailed Expression Capture and Animation), and generated a large amount of learning data by rotating the three-dimensional model in different directions. Then we obtained the facial features around the lips as time series data using dlib, which is an OpenCV module. The time series data is fed into a recognition model. We introduced end-to-end recognition model which is used in continuous speech recognition. The recognition model is based on DeepSpeech2, and we modified the filter size of CNN (convolutional neural network) and the number of layers of BiGRU (bidirectional gated recurrent unit) to adapt to our goal. We evaluated the proposed method on ten ordinary Japanese words. We used the phoneme error rate as the evaluation index. As a result of the evaluation, the error rate of about 0.32 was achieved even for data where the speaker of the evaluation data was not included in the speaker of the training data.
acoustics,audiology & speech-language pathology
What problem does this paper attempt to address?