Abstract:Recurrent neural networks (RNNs) are widely utilized in neural network research to capture spatiotemporal features in video data. However, their effectiveness heavily relies on the spatial features upon which they trained. This paper introduces innovative ensembles of features for constructing frame-wise structures by employing impactful neural network models with innovative training pipelines. These features are designed to enhance the recognition of hand gesture videos using RNN by leveraging temporal information. Recognizing hand gestures from videos is a complex task that presents considerable challenges. One notable challenge is the overlap in gesture motion, where different gesture categories exhibit similar hand poses within a single video clip. To overcome this issue, we were motivated to develop extensive and diverse features that offer a more comprehensive description of the gesture video clips, thereby mitigating recognition problems caused by images overlapping. Overall, our efforts to generate diverse features have yielded promising results in enhancing the recognition of hand gestures from videos, particularly in scenarios where overlap poses a significant challenge. We have combined the extracted features from a deep neural network trained from scratch with features obtained from various standard neural networks (Self-Organizing Map, Radial Base Function) that are trained to enhance the deep-trained features. The mutual arrangement for combining the shared features has configured new frame-wise image features. Furthermore, we have provided a performance comparison of the newly constructed frame-wise features through time-sharing to train RNN for recognition. The proposed models have been evaluated on two-hand gesture video datasets, where a preserving gesture sequence is crucial due to overlapping motions. Our work demonstrates a significant improvement in performance for both datasets.

Beyond Temporal Pooling: Recurrence and Temporal Convolutions for Gesture Recognition in Video

Short-Term Temporal Convolutional Networks for Dynamic Hand Gesture Recognition

Long-Term Recurrent Convolutional Networks for Visual Recognition and Description

Multimodal Spatiotemporal Feature Map for Dynamic Gesture Recognition

Deep learning models beyond temporal frame-wise features for hand gesture video recognition

Temporal Pyramid Relation Network for Video-Based Gesture Recognition

Gesture recognition based on deep deformable 3D convolutional neural networks

Convolutional neural network with spatial pyramid pooling for hand gesture recognition

High Performance Gesture Recognition Via Effective and Efficient Temporal Modeling.

Rapid Decoding of Hand Gestures in Electrocorticography Using Recurrent Neural Networks.

Temporal-attentive Covariance Pooling Networks for Video Recognition

Selective spatiotemporal features learning for dynamic gesture recognition

A Spatio-Temporal Multilayer Perceptron for Gesture Recognition

A New Spiking Convolutional Recurrent Neural Network (SCRNN) With Applications to Event-Based Hand Gesture Recognition

TS-LSTM and Temporal-Inception: Exploiting Spatiotemporal Dynamics for Activity Recognition

Searching Multi-Rate and Multi-Modal Temporal Enhanced Networks for Gesture Recognition

Spatiotemporal features representation with dynamic mode decomposition for hand gesture recognition using deep neural networks

Temporal Decoupling Graph Convolutional Network for Skeleton-Based Gesture Recognition

Large-scale Isolated Gesture Recognition Using Convolutional Neural Networks

GestFormer: Multiscale Wavelet Pooling Transformer Network for Dynamic Hand Gesture Recognition

Efficient Gesture Recognition on Spiking Convolutional Networks Through Sensor Fusion of Event-Based and Depth Data