Abstract:Automatic fingerspelling recognition aims to overcome communication barriers between people who are deaf and those who can hear. RGB-D cameras are widely used to handle finger occlusion, which usually hinders fingerspelling recognition. However, color-depth misalignment, which is an intrinsic property of RGB-D cameras, hinders the simultaneous processing of color and depth images in the absence of intrinsic parameters of the camera. Furthermore, fine-grained hand gestures performed by various persons and captured from multiple views render the discriminative feature extraction difficult, due to intra-class variability and inter-class similarity. Inspired by the human visual mechanism, we propose a network to learn discriminative features related to fine-grained hand gestures while suppressing the effect of color–depth misalignment. Unlike existing approaches that independently process RGB-D images, a dual-path depth-aware attention network that learns a fingerspelling representation in separate RGB and depth paths, and progressively fuses the features learned from the two paths is proposed. As the hand is usually the closest object to the camera, depth information can contribute to emphasize the key fingers related to a letter sign. Thus, we develop a depth-aware attention module (DAM) to exploit spatial relations in the depth feature maps, refining the RGB and depth feature maps across a bottleneck structure. The module establishes a lateral connection of the RGB and depth paths and provides a depth-aware salient map to both paths. The experimental results demonstrated that the proposed network improved the accuracy (+0.83%) and $F$ score (+1.55%) compared to state-of-the-art methods on a publicly available fingerspelling dataset. The visualization of the network processes demonstrates that the DAM -acilitates the selection of representative hand regions from the RGB-D images. Furthermore, the number of parameters and computational overhead of the DAM are negligible in the network. The code is available at https://github.com/cweizen/cweizen-DDaNet_model_master.

DDaNet: Dual-Path Depth-Aware Attention Network for Fingerspelling Recognition Using RGB-D Images

FDN: Feature Decoupling Network for Head Pose Estimation.

ACNET: Attention Based Network to Exploit Complementary Features for RGBD Semantic Segmentation.

A Fine-Grained Visual Attention Approach for Fingerspelling Recognition in the Wild

Weakly-supervised Disentanglement Network for Video Fingerspelling Detection

Sign language recognition based on dual-path background erasure convolutional neural network

Searching Multi-Rate and Multi-Modal Temporal Enhanced Networks for Gesture Recognition

Lexicon-Free Fingerspelling Recognition from Video: Data, Models, and Signer Adaptation

Recognizing American Sign Language Manual Signs from Rgb-D Videos

DPANet: Depth Potentiality-Aware Gated Attention Network for RGB-D Salient Object Detection

DCANet: Differential Convolution Attention Network for RGB-D Semantic Segmentation

TANet: Transformer-based Asymmetric Network for RGB-D Salient Object Detection

Depth awakens: A depth-perceptual attention fusion network for RGB-D camouflaged object detection

Multi-Task and Multi-Modal Learning for RGB Dynamic Gesture Recognition

Regional Attention with Architecture-Rebuilt 3D Network for RGB-D Gesture Recognition

Ddrnet: Depth Map Denoising And Refinement For Consumer Depth Cameras Using Cascaded Cnns

Attention Based Dual Branches Fingertip Detection Network and Virtual Key System

CFIDNet: cascaded feature interaction decoder for RGB-D salient object detection

DADCNet: Dual Attention Densely Connected Network for More Accurate Real Iris Region Segmentation

RGB-D Grasp Detection via Depth Guided Learning with Cross-modal Attention

A Lightweight Hand-Gesture Recognition Network With Feature Fusion Prefiltering and FMCW Radar Spatial Angle Estimation