Zero-Shot Underwater Gesture Recognition

Sandipan Sarma,Gundameedi Sai Ram Mohan,Hariansh Sehgal,Arijit Sur
2024-07-19
Abstract:Hand gesture recognition allows humans to interact with machines non-verbally, which has a huge application in underwater exploration using autonomous underwater vehicles. Recently, a new gesture-based language called CADDIAN has been devised for divers, and supervised learning methods have been applied to recognize the gestures with high accuracy. However, such methods fail when they encounter unseen gestures in real time. In this work, we advocate the need for zero-shot underwater gesture recognition (ZSUGR), where the objective is to train a model with visual samples of gestures from a few ``seen'' classes only and transfer the gained knowledge at test time to recognize semantically-similar unseen gesture classes as well. After discussing the problem and dataset-specific challenges, we propose new seen-unseen splits for gesture classes in CADDY dataset. Then, we present a two-stage framework, where a novel transformer learns strong visual gesture cues and feeds them to a conditional generative adversarial network that learns to mimic feature distribution. We use the trained generator as a feature synthesizer for unseen classes, enabling zero-shot learning. Extensive experiments demonstrate that our method outperforms the existing zero-shot techniques. We conclude by providing useful insights into our framework and suggesting directions for future research.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the zero - shot learning (ZSL) problem in underwater gesture recognition. Specifically, the author focuses on how to train a model with known gesture classes so that the model can recognize these unseen gesture classes without having seen certain gestures. This is especially important in the underwater environment because divers may use some gestures that are not predefined or labeled in advance for communication, and existing supervised learning methods cannot handle these unseen gestures. ### Problem Background 1. **Challenges in the Underwater Environment**: Underwater images usually have problems such as low contrast, blurriness, and color distortion, making it difficult for traditional gesture recognition methods to analyze effectively. 2. **Data Scarcity**: Due to the complexity and danger of the underwater environment, it is very difficult to collect enough labeled data to cover all possible gestures. 3. **Limitations of Existing Methods**: Most existing gesture recognition models are supervised learning models, which can only recognize gesture classes that already exist in the training set and cannot recognize unseen gestures. ### Solution To solve these problems, the author proposes the zero - shot underwater gesture recognition (ZSUGR) task and designs a two - stage framework: 1. **First Stage: Feature Extraction** - Use a novel Gated Cross - Attention Transformer (GCAT) to extract powerful visual features from the pre - trained ResNet - 50. - GCAT combines the powerful visual representations from the CLIP image encoder through the self - attention mechanism and the cross - attention mechanism to generate more effective gesture features. 2. **Second Stage: Generative Adversarial Network (GAN)** - Use the conditional Wasserstein GAN (c - WGAN) to synthesize the visual features of unseen gesture classes. - The trained c - WGAN can generate visual features corresponding to unseen gesture classes, thus achieving zero - shot learning. ### Main Contributions 1. **Introducing a New Task**: Propose and study the zero - shot learning problem in underwater gesture recognition for the first time. 2. **Dataset Partition**: Propose a new seen - unseen category partition for the CADDY dataset for zero - shot training and model evaluation. 3. **Two - Stage Framework**: Design a two - stage framework that includes a new Transformer and GAN, which can effectively extract underwater visual features and synthesize unseen gesture features. ### Experimental Results Experiments show that this method outperforms existing zero - shot classification methods in both the traditional zero - shot learning (CZSL) and generalized zero - shot learning (GZSL) settings. Especially in the GZSL setting, this method shows better performance in recognizing unseen gesture classes and achieves the highest harmonic mean precision (29.53±7.06%). In conclusion, this paper solves the data scarcity and class imbalance problems encountered in underwater gesture recognition by introducing zero - shot learning technology, providing a new solution for underwater exploration and human - machine interaction.