Abstract:In the realm of Human-Computer Interaction (HCI), the importance of hands cannot be overstated. Hands serve as a fundamental means of communication, expression, and interaction in the physical world. In recent years, Augmented Reality (AR) has emerged as a next-generation technology that seamlessly merges the digital and physical worlds, providing transformative experiences across various domains. In this context, accurate hand pose and shape estimation plays a crucial role in enabling natural and intuitive interactions within AR environments. Augmented Reality, with its ability to overlay digital information onto the real world, has the potential to revolutionize how we interact with technology. From gaming and education to healthcare and industrial training, AR has opened up new possibilities for enhancing user experiences. This study proposes an innovative approach for hand pose and shape estimation in AR applications. The methodology commences with the utilization of a pre-trained Single Shot Multi-Box (SSD) model for hand detection and cropping. The cropped hand image is then transformed into the HSV color model, followed by applying histogram equalization on the value band. To precisely isolate the hand, specific bounds are set for each band of the HSV color space, generating a mask. To refine the mask and diminish noise, contouring techniques are applied to the mask, and gap-filling methods are employed. The resultant refined mask is then combined with the original cropped image through logical AND operations to accurately delineate the hand boundaries. This meticulous approach ensures robust hand detection even in complex scenes. To extract pertinent features, the detected hand undergoes two concurrent processes. Firstly, the Scale-Invariant Feature Transform (SIFT) algorithm identifies distinctive keypoints on the hand's outer surface. Simultaneously, a pre-trained lightweight Convolutional Neural Network (CNN), namely MobileNet, is employed to extract 3D hand landmarks, the hand's center (middle finger metacarpophalangeal joint), and handedness information. These extracted features, encompassing hand keypoints, landmarks, center, and handedness, are aggregated and compiled into a CSV file for further analysis. A Gated Recurrent Unit (GRU) is then employed to process the features, capturing intricate dependencies between them. The GRU model successfully predicts the 3D hand pose, achieving high accuracy even in dynamic scenarios. The evaluation results for the proposed model are very promising that the Mean Per Joint Position Error in 3D (MPJPE) is 0.0596 between the predicted pose and the ground truth hand landmarks, while the Percentage of Correct Keypoints (PCK) is 95%. Upon predicting the hand pose, a mesh representation is employed to reconstruct the 3D shape of the hand. This mesh provides a tangible representation of the hand's structure and orientation, enhancing the realism and usability of the AR application. By integrating sophisticated detection, feature extraction, and predictive modeling techniques, this method contributes to creating more immersive and intuitive AR experiences, thereby fostering the seamless fusion of the digital and physical worlds.

Multiple-Hand 2D Pose Estimation From a Monocular RGB Image

Dual Regression for Efficient Hand Pose Estimation

Using a single RGB frame for real time 3D hand pose estimation in the wild

Personalized Hand Modeling from Multiple Postures with Multi‐View Color Images

Hand Pose Estimation via Latent 2.5D Heatmap Regression

A graph-based approach for absolute 3D hand pose estimation using a single RGB image

Estimate Hand Poses Efficiently from Single Depth Images

Mask-Pose Cascaded CNN for 2D Hand Pose Estimation from Single Color Image

Cascaded hierarchical CNN for 2D hand pose estimation from a single color image

Skeleton-aware multi-scale heatmap regression for 2D hand pose estimation

3D Hand Pose and Shape Estimation from Single RGB Image for Augmented Reality

RGB2Hands: Real-Time Tracking of 3D Hand Interactions from Monocular RGB Video

Monocular Real-time Hand Shape and Motion Capture using Multi-modal Data

3-D Hand Pose Estimation from Kinect's Point Cloud Using Appearance Matching

A hybrid network for estimating 3D interacting hand pose from a single RGB image

Efficient 2.5D Hand Pose Estimation via Auxiliary Multi-Task Training for Embedded Devices

Silhouette-Net: 3D Hand Pose Estimation from Silhouettes

Attention-based hand pose estimation with voting and dual modalities

3D hand pose estimation using RGBD images and hybrid deep learning networks

Graph-Based CNNs With Self-Supervised Module for 3D Hand Pose Estimation From Monocular RGB

End-to-End Weakly-Supervised Single-Stage Multiple 3d Hand Mesh Reconstruction from a Single Rgb Image