Abstract:Hand gesture recognition allows humans to interact with machines non-verbally, which has a huge application in underwater exploration using autonomous underwater vehicles. Recently, a new gesture-based language called CADDIAN has been devised for divers, and supervised learning methods have been applied to recognize the gestures with high accuracy. However, such methods fail when they encounter unseen gestures in real time. In this work, we advocate the need for zero-shot underwater gesture recognition (ZSUGR), where the objective is to train a model with visual samples of gestures from a few ``seen'' classes only and transfer the gained knowledge at test time to recognize semantically-similar unseen gesture classes as well. After discussing the problem and dataset-specific challenges, we propose new seen-unseen splits for gesture classes in CADDY dataset. Then, we present a two-stage framework, where a novel transformer learns strong visual gesture cues and feeds them to a conditional generative adversarial network that learns to mimic feature distribution. We use the trained generator as a feature synthesizer for unseen classes, enabling zero-shot learning. Extensive experiments demonstrate that our method outperforms the existing zero-shot techniques. We conclude by providing useful insights into our framework and suggesting directions for future research.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the zero - shot learning (ZSL) problem in underwater gesture recognition. Specifically, the author focuses on how to train a model with known gesture classes so that the model can recognize these unseen gesture classes without having seen certain gestures. This is especially important in the underwater environment because divers may use some gestures that are not predefined or labeled in advance for communication, and existing supervised learning methods cannot handle these unseen gestures. ### Problem Background 1. **Challenges in the Underwater Environment**: Underwater images usually have problems such as low contrast, blurriness, and color distortion, making it difficult for traditional gesture recognition methods to analyze effectively. 2. **Data Scarcity**: Due to the complexity and danger of the underwater environment, it is very difficult to collect enough labeled data to cover all possible gestures. 3. **Limitations of Existing Methods**: Most existing gesture recognition models are supervised learning models, which can only recognize gesture classes that already exist in the training set and cannot recognize unseen gestures. ### Solution To solve these problems, the author proposes the zero - shot underwater gesture recognition (ZSUGR) task and designs a two - stage framework: 1. **First Stage: Feature Extraction** - Use a novel Gated Cross - Attention Transformer (GCAT) to extract powerful visual features from the pre - trained ResNet - 50. - GCAT combines the powerful visual representations from the CLIP image encoder through the self - attention mechanism and the cross - attention mechanism to generate more effective gesture features. 2. **Second Stage: Generative Adversarial Network (GAN)** - Use the conditional Wasserstein GAN (c - WGAN) to synthesize the visual features of unseen gesture classes. - The trained c - WGAN can generate visual features corresponding to unseen gesture classes, thus achieving zero - shot learning. ### Main Contributions 1. **Introducing a New Task**: Propose and study the zero - shot learning problem in underwater gesture recognition for the first time. 2. **Dataset Partition**: Propose a new seen - unseen category partition for the CADDY dataset for zero - shot training and model evaluation. 3. **Two - Stage Framework**: Design a two - stage framework that includes a new Transformer and GAN, which can effectively extract underwater visual features and synthesize unseen gesture features. ### Experimental Results Experiments show that this method outperforms existing zero - shot classification methods in both the traditional zero - shot learning (CZSL) and generalized zero - shot learning (GZSL) settings. Especially in the GZSL setting, this method shows better performance in recognizing unseen gesture classes and achieves the highest harmonic mean precision (29.53±7.06%). In conclusion, this paper solves the data scarcity and class imbalance problems encountered in underwater gesture recognition by introducing zero - shot learning technology, providing a new solution for underwater exploration and human - machine interaction.

Zero-Shot Underwater Gesture Recognition

Zero-Shot Detection with Transferable Object Proposal Mechanism.

Interpretable Underwater Diver Gesture Recognition

RGC: Reliable Gesture Classification Via Wearables Using GANs-Based Data Augmentation.

Multi-modal zero-shot dynamic hand gesture recognition

Diver-robot communication dataset for underwater hand gesture recognition

A Prototype-Based Generalized Zero-Shot Learning Framework for Hand Gesture Recognition

Understanding human motion and gestures for underwater human-robot collaboration

A Diving Glove with Inertial Sensors for Underwater Gesture Recognition.

Single Shot Detector CNN and Deep Dilated Masks for Vision-Based Hand Gesture Recognition From Video Sequences

Robust Gesture-Based Communication for Underwater Human-Robot Interaction in the context of Search and Rescue Diver Missions

An Underwater Human-Robot Interaction Using a Visual-Textual Model for Autonomous Underwater Vehicles

Gesture-based Human-robot Interaction for Field Programmable Autonomous Underwater Robots

Smart-Data-Glove-Based Gesture Recognition for Amphibious Communication

Learning Unseen Emotions from Gestures via Semantically-Conditioned Zero-Shot Perception with Adversarial Autoencoders

GestureGPT: Toward Zero-Shot Free-Form Hand Gesture Understanding with Large Language Model Agents

GestureGPT: Toward Zero-shot Interactive Gesture Understanding and Grounding with Large Language Model Agents

Underwater Gesture Recognition Meta-Gloves for Marine Immersive Communication

Multi-Modal Zero-Shot Sign Language Recognition

CADDY Underwater Stereo-Vision Dataset for Human-Robot Interaction (HRI) in the Context of Diver Activities

Data-Free Class Incremental Gesture Recognition via Synthetic Feature Sampling