Abstract:3D Hand reconstruction from a single RGB image is challenging due to the articulated motion, self-occlusion, and interaction with objects. Existing SOTA methods employ attention-based transformers to learn the 3D hand pose and shape, but they fail to achieve robust and accurate performance due to insufficient modeling of joint spatial relations. To address this problem, we propose a novel graph-guided Mamba framework, named Hamba, which bridges graph learning and state space modeling. Our core idea is to reformulate Mamba's scanning into graph-guided bidirectional scanning for 3D reconstruction using a few effective tokens. This enables us to learn the joint relations and spatial sequences for enhancing the reconstruction performance. Specifically, we design a novel Graph-guided State Space (GSS) block that learns the graph-structured relations and spatial sequences of joints and uses 88.5% fewer tokens than attention-based methods. Additionally, we integrate the state space features and the global features using a fusion module. By utilizing the GSS block and the fusion module, Hamba effectively leverages the graph-guided state space modeling features and jointly considers global and local features to improve performance. Extensive experiments on several benchmarks and in-the-wild tests demonstrate that Hamba significantly outperforms existing SOTAs, achieving the PA-MPVPE of 5.3mm and F@15mm of 0.992 on FreiHAND. Hamba is currently Rank 1 in two challenging competition leaderboards on 3D hand reconstruction. The code will be available upon acceptance. [Website](<a class="link-external link-https" href="https://humansensinglab.github.io/Hamba/" rel="external noopener nofollow">this https URL</a>).

Local Spherical Harmonics Improve Skeleton-Based Hand Action Recognition

Shifting Perspective to See Difference: A Novel Multi-View Method for Skeleton Based Action Recognition

Online Robust Action Recognition Based on a Hierarchical Model

On the Utility of 3D Hand Poses for Action Recognition

MVHANet: Multi-view Hierarchical Aggregation Network for Skeleton-Based Hand Gesture Recognition

Effective Human Action Recognition Using Global and Local Offsets of Skeleton Joints.

3d Human Action Recognition Based On The Spatial-Temporal Moving Skeleton Descriptor

Skeleton-Based Mutually Assisted Interacted Object Localization and Human Action Recognition

Exploring Self-Supervised Skeleton-Based Human Action Recognition under Occlusions

Human Action Recognition Based on Kinematic Similarity in Real Time.

GEARS: Local Geometry-aware Hand-object Interaction Synthesis

CaSAR: Contact-aware Skeletal Action Recognition

Domain and View-point Agnostic Hand Action Recognition

Unveiling the Hidden Realm: Self-supervised Skeleton-based Action Recognition in Occluded Environments

Hamba: Single-view 3D Hand Reconstruction with Graph-guided Bi-Scanning Mamba

Real-Time Monocular Skeleton-Based Hand Gesture Recognition Using 3D-Jointsformer

XHand: Real-time Expressive Hand Avatar

Dynamic Hand Gesture Recognition Based On 3D Skeleton

Local and Global Point Cloud Reconstruction for 3D Hand Pose Estimation

3D Human Activity Recognition Using Skeletal Data from RGBD Sensors.

Hand Avatar: Free-Pose Hand Animation and Rendering from Monocular Video