Abstract:Real-time 2D keypoint detection plays an essential role in computer vision. Although CNN-based and Transformer-based methods have achieved breakthrough progress, they often fail to deliver superior performance and real-time speed. This paper introduces MamKPD, the first efficient yet effective mamba-based pose estimation framework for 2D keypoint detection. The conventional Mamba module exhibits limited information interaction between patches. To address this, we propose a lightweight contextual modeling module (CMM) that uses depth-wise convolutions to model inter-patch dependencies and linear layers to distill the pose cues within each patch. Subsequently, by combining Mamba for global modeling across all patches, MamKPD effectively extracts instances' pose information. We conduct extensive experiments on human and animal pose estimation datasets to validate the effectiveness of MamKPD. Our MamKPD-L achieves 77.3% AP on the COCO dataset with 1492 FPS on an NVIDIA GTX 4090 GPU. Moreover, MamKPD achieves state-of-the-art results on the MPII dataset and competitive results on the AP-10K dataset while saving 85% of the parameters compared to ViTPose. Our project page is available at <a class="link-external link-https" href="https://mamkpd.github.io/" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the balance between efficiency and accuracy in real - time 2D keypoint detection. Although methods based on CNN and Transformer have made remarkable progress in 2D keypoint detection, these methods often require expensive computing resources and it is difficult to achieve both high performance and real - time speed simultaneously. Specifically: 1. **Limitations of existing methods**: - **Large network scale**: Many existing 2D keypoint detection methods rely on large - scale neural networks (such as deep convolutional neural networks or Transformer), which lead to high computing costs and low inference speeds. - **Trade - off between accuracy and efficiency**: In order to improve efficiency, some lightweight network architectures reduce the number of parameters, but usually at the cost of sacrificing detection accuracy. 2. **Research objectives**: - **Improve model efficiency**: Design an efficient 2D keypoint detection framework that can significantly improve the inference speed while maintaining high accuracy. - **Explore the application of Mamba module**: Apply the Mamba module to the 2D keypoint detection task for the first time to utilize its efficient state - space modeling ability. 3. **Proposed new method**: - **MamKPD framework**: Introduce a new 2D keypoint detection framework named MamKPD, which is based on the Mamba module and combines a lightweight context - modeling module (CMM) to enhance information interaction. - **CMM module**: Capture the dependencies between image patches through deep convolution and linear layers, thereby enhancing multi - scale feature extraction capabilities. 4. **Experimental verification**: - **Dataset**: Extensive experiments were carried out on datasets such as COCO, MPII, and AP - 10K to verify the effectiveness of MamKPD. - **Performance comparison**: MamKPD not only performs excellently in inference speed (for example, reaching 1492 FPS on an NVIDIA GTX 4090 GPU), but also is competitive in accuracy, and even exceeds existing methods on some datasets. In summary, this paper aims to solve the problem that it is difficult to balance efficiency and accuracy in existing 2D keypoint detection methods by introducing the MamKPD framework, especially in real - time application scenarios.

MamKPD: A Simple Mamba Baseline for Real-Time 2D Keypoint Detection

Motion Parameters Measurement of User-Defined Key Points Using 3D Pose Estimation

Towards High Performance Human Keypoint Detection

PoseMamba: Monocular 3D Human Pose Estimation with Bidirectional Global-Local Spatio-Temporal State Space Model

KMM: Key Frame Mask Mamba for Extended Motion Generation

RTMPose: Real-Time Multi-Person Pose Estimation based on MMPose

MonoMM: A Multi-scale Mamba-Enhanced Network for Real-time Monocular 3D Object Detection

Cascaded Pyramid Network for Multi-Person Pose Estimation

MambaTrack: A Simple Baseline for Multiple Object Tracking with State Space Model

X-Pose: Detecting Any Keypoints

Keypoint-Aware Single-Stage 3D Object Detector for Autonomous Driving

Pose Magic: Efficient and Temporally Consistent Human Pose Estimation with a Hybrid Mamba-GCN Network

Efficient Human Pose Estimation via 3D Event Point Cloud

PointMamba: A Simple State Space Model for Point Cloud Analysis

3D-MuPPET: 3D Multi-Pigeon Pose Estimation and Tracking

Open-Vocabulary Animal Keypoint Detection with Semantic-feature Matching

SD-Pose: facilitating space-decoupled human pose estimation via adaptive pose perception guidance

KeypointDETR: an End-to-End 3D Keypoint Detector

APP: Adaptive Pose Pooling for 3D Human Pose Estimation from Videos

PMotion: An advanced markerless pose estimation approach based on novel deep learning framework used to reveal neurobehavior

Intelligent vehicle visual pose estimation algorithm based on deep learning and parallel computing for dynamic scenes