A Two-Stream CNN Framework for American Sign Language Recognition Based on Multimodal Data Fusion

Qing Gao,Uchenna Emeoha Ogenyi,Jinguo Liu,Zhaojie Ju,Honghai Liu
DOI: https://doi.org/10.1007/978-3-030-29933-0_9
2019-08-30
Abstract:At present, vision-based hand gesture recognition is very important in human-robot interaction (HRI). This non-contact method enables natural and friendly interaction between people and robots. Aiming at this technology, a two-stream CNN framework (2S-CNN) is proposed to recognize the American sign language (ASL) hand gestures based on multimodal (RGB and depth) data fusion. Firstly, the hand gesture data is enhanced to remove the influence of background and noise. Secondly, hand gesture RGB and depth features are extracted for hand gesture recognition using CNNs on two streams, respectively. Finally, a fusion layer is designed for fusing the recognition results of the two streams. This method utilizes multimodal data to increase the recognition accuracy of the ASL hand gestures. The experiments prove that the recognition accuracy of 2S-CNN can reach 92.08documentclass[12pt]{minimal}usepackage{amsmath}usepackage{wasysym}usepackage{amsfonts}usepackage{amssymb}usepackage{amsbsy}usepackage{mathrsfs}usepackage{upgreek}setlength{oddsidemargin}{-69pt}egin{document}$$\%$$end{document} on ASL fingerspelling database and is higher than that of baseline methods.
What problem does this paper attempt to address?