Automatic Detection of the Temporal Segmentation of Hand Movements in British English Cued Speech

Li Liu,Jianze Li,Gang Feng,Xiao-Ping Zhang
DOI: https://doi.org/10.21437/interspeech.2019-2353
2019-01-01
Abstract:Cued Speech (CS) is a multi-modal system, which complements the lip reading with manual hand cues in the phonetic level to make the spoken language visible. It has been found that lip and hand movements are asynchronous in CS, and thus the study of hand temporal organization is very important for the multi-modal CS feature fusion. In this work, we propose a novel diphthong-hand preceding model (D-HPM) by investigating the relationship between hand preceding time (HPT) and diphthong time instants in sentences for British English CS. Besides, we demonstrate that HPT of the first and second parts of diphthongs has a very strong correlation. Combining the monophthong-HPM (M-HPM) and D-HPM, we present a hybrid temporal segmentation detection algorithm (HTSDA) for the hand movement in CS. The evaluation of the proposed algorithm is carried out by a hand position recognition experiment using the multi-Gaussian classifier as well as the long-short term memory (LSTM). The results show that the HTSDA significantly improves the recognition performance compared with the baseline (i.e., audio-based segmentation) and the state-of-the-art M-HPM. To the best of our knowledge, this is the first work to study the temporal organization of hand movements in British English CS.
What problem does this paper attempt to address?