Abstract:In this paper, we introduce a novel Multiscale Video Transformer Network (MVTN) for dynamic hand gesture recognition, since multiscale features can extract features with variable size, pose, and shape of hand which is a challenge in hand gesture recognition. The proposed model incorporates a multiscale feature hierarchy to capture diverse levels of detail and context within hand gestures which enhances the model's ability. This multiscale hierarchy is obtained by extracting different dimensions of attention in different transformer stages with initial stages to model high-resolution features and later stages to model low-resolution features. Our approach also leverages multimodal data, utilizing depth maps, infrared data, and surface normals along with RGB images from NVGesture and Briareo datasets. Experiments show that the proposed MVTN achieves state-of-the-art results with less computational complexity and parameters. The source code is available at <a class="link-external link-https" href="https://github.com/mallikagarg/MVTN" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

This paper attempts to solve several key problems in dynamic gesture recognition, specifically including: 1. **Multi - scale feature extraction**: In cases where there are large changes in hand pose, size, and shape, single - scale feature extraction is difficult to cope with these changes. The paper proposes a multi - scale video Transformer network (MVTN), which extracts multi - scale attention features through Transformer models at different stages, thus being able to handle changes in hand gestures at different scales. 2. **Optimization of computational complexity and the number of parameters**: When dealing with high - resolution tasks, traditional Transformer models have high computational complexity and memory consumption due to their global feature extraction mechanism and quadratically growing attention mechanism. MVTN gradually reduces the attention dimension through linear projection, reducing the computational cost and the number of parameters while maintaining the model performance. 3. **Utilization of multi - modal data**: In order to improve the accuracy of gesture recognition, MVTN makes full use of data provided by multiple sensors, such as RGB images, depth maps, infrared data, and surface normals. By fusing this multi - modal information, the model can better recognize gestures under different input conditions. ### Main contributions of the paper 1. **Multi - scale attention pyramid structure**: MVTN introduces a multi - scale attention mechanism, which extracts features at different scales through Transformer models at different stages, helping the model better handle changes in hand pose, size, and shape. 2. **Convolution - free multi - scale Transformer design**: By gradually reducing the attention dimension through linear projection, MVTN designs a convolution - free multi - scale Transformer, reducing the computational cost while learning multi - level context information. 3. **Effective fusion of multi - modal data**: MVTN conducts experiments with single - modal and multi - modal inputs on the NVGesture and Briareo datasets. The results show that this model has advantages in multi - modal data fusion and its performance is better than the existing state - of - the - art methods. ### Summary This paper solves the challenges brought by changes in hand pose, size, and shape in dynamic gesture recognition by introducing multi - scale attention mechanisms and linear projection methods, and further improves the recognition accuracy through multi - modal data fusion. At the same time, MVTN also performs well in reducing computational complexity and the number of parameters, providing an efficient and robust solution for dynamic gesture recognition.

MVTN: A Multiscale Video Transformer Network for Hand Gesture Recognition

GestFormer: Multiscale Wavelet Pooling Transformer Network for Dynamic Hand Gesture Recognition

ViT-HGR: Vision Transformer-based Hand Gesture Recognition from High Density Surface EMG Signals

ConvMixFormer- A Resource-efficient Convolution Mixer for Transformer-based Dynamic Hand Gesture Recognition

MVHANet: Multi-view Hierarchical Aggregation Network for Skeleton-Based Hand Gesture Recognition

HGR-ViT: Hand Gesture Recognition with Vision Transformer

MVTN: Learning Multi-View Transformations for 3D Understanding

Improved mKLT and low layered HG-CNN based dynamic gesture recognition hardware system

Multiview Transformers for Video Recognition

Surface Electromyography-based Gesture Recognition by Multi-view Deep Learning.

Multi-Scale Attention 3D Convolutional Network for Multimodal Gesture Recognition

A Novel Approach to Surface EMG-based Gesture Classification Using a Vision Transformer Integrated with Convolutive Blind Source Separation

Integration of Convolutional Neural Network and Vision Transformer for Gesture Recognition Using Semg

Static Hand Gesture Recognition Method Based on the Vision Transformer

Single Shot Detector CNN and Deep Dilated Masks for Vision-Based Hand Gesture Recognition From Video Sequences

Multimode Gesture Recognition Algorithm Based on Convolutional Long Short-Term Memory Network

Multiresolution Match Kernels for Gesture Video Classification

Short-Term Temporal Convolutional Networks for Dynamic Hand Gesture Recognition

Multimodal Gesture Recognition Using Multi-stream Recurrent Neural Network

A Convolutional-Transformer-Based Approach for Dynamic Gesture Recognition of Data Gloves

Spatiotemporal features representation with dynamic mode decomposition for hand gesture recognition using deep neural networks