MVTN: A Multiscale Video Transformer Network for Hand Gesture Recognition

Mallika Garg,Debashis Ghosh,Pyari Mohan Pradhan
2024-09-06
Abstract:In this paper, we introduce a novel Multiscale Video Transformer Network (MVTN) for dynamic hand gesture recognition, since multiscale features can extract features with variable size, pose, and shape of hand which is a challenge in hand gesture recognition. The proposed model incorporates a multiscale feature hierarchy to capture diverse levels of detail and context within hand gestures which enhances the model's ability. This multiscale hierarchy is obtained by extracting different dimensions of attention in different transformer stages with initial stages to model high-resolution features and later stages to model low-resolution features. Our approach also leverages multimodal data, utilizing depth maps, infrared data, and surface normals along with RGB images from NVGesture and Briareo datasets. Experiments show that the proposed MVTN achieves state-of-the-art results with less computational complexity and parameters. The source code is available at <a class="link-external link-https" href="https://github.com/mallikagarg/MVTN" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Human-Computer Interaction
What problem does this paper attempt to address?
This paper attempts to solve several key problems in dynamic gesture recognition, specifically including: 1. **Multi - scale feature extraction**: In cases where there are large changes in hand pose, size, and shape, single - scale feature extraction is difficult to cope with these changes. The paper proposes a multi - scale video Transformer network (MVTN), which extracts multi - scale attention features through Transformer models at different stages, thus being able to handle changes in hand gestures at different scales. 2. **Optimization of computational complexity and the number of parameters**: When dealing with high - resolution tasks, traditional Transformer models have high computational complexity and memory consumption due to their global feature extraction mechanism and quadratically growing attention mechanism. MVTN gradually reduces the attention dimension through linear projection, reducing the computational cost and the number of parameters while maintaining the model performance. 3. **Utilization of multi - modal data**: In order to improve the accuracy of gesture recognition, MVTN makes full use of data provided by multiple sensors, such as RGB images, depth maps, infrared data, and surface normals. By fusing this multi - modal information, the model can better recognize gestures under different input conditions. ### Main contributions of the paper 1. **Multi - scale attention pyramid structure**: MVTN introduces a multi - scale attention mechanism, which extracts features at different scales through Transformer models at different stages, helping the model better handle changes in hand pose, size, and shape. 2. **Convolution - free multi - scale Transformer design**: By gradually reducing the attention dimension through linear projection, MVTN designs a convolution - free multi - scale Transformer, reducing the computational cost while learning multi - level context information. 3. **Effective fusion of multi - modal data**: MVTN conducts experiments with single - modal and multi - modal inputs on the NVGesture and Briareo datasets. The results show that this model has advantages in multi - modal data fusion and its performance is better than the existing state - of - the - art methods. ### Summary This paper solves the challenges brought by changes in hand pose, size, and shape in dynamic gesture recognition by introducing multi - scale attention mechanisms and linear projection methods, and further improves the recognition accuracy through multi - modal data fusion. At the same time, MVTN also performs well in reducing computational complexity and the number of parameters, providing an efficient and robust solution for dynamic gesture recognition.