Abstract:Category-agnostic pose estimation (CAPE) has traditionally relied on support images with annotated keypoints, a process that is often cumbersome and may fail to fully capture the necessary correspondences across diverse object categories. Recent efforts have begun exploring the use of text-based queries, where the need for support keypoints is eliminated. However, the optimal use of textual descriptions for keypoints remains an underexplored area. In this work, we introduce CapeLLM, a novel approach that leverages a text-based multimodal large language model (MLLM) for CAPE. Our method only employs query image and detailed text descriptions as an input to estimate category-agnostic keypoints. We conduct extensive experiments to systematically explore the design space of LLM-based CAPE, investigating factors such as choosing the optimal description for keypoints, neural network architectures, and training strategies. Thanks to the advanced reasoning capabilities of the pre-trained MLLM, CapeLLM demonstrates superior generalization and robust performance. Our approach sets a new state-of-the-art on the MP-100 benchmark in the challenging 1-shot setting, marking a significant advancement in the field of category-agnostic pose estimation.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **How to achieve Category - Agnostic Pose Estimation (CAPE) without relying on support images?** Traditional CAPE methods usually rely on support images with annotated key points. This method is not only cumbersome but may also fail to fully capture the necessary correspondences between different classes of objects. In addition, although text - query - based methods reduce the need for support key points, how to optimally utilize text to describe key points remains an under - explored area. ### Main contributions of the paper: 1. **Propose the CapeLLM framework**: This is a CAPE method without the need for support images. It estimates the positions of category - agnostic key points through a multi - modal large language model (MLLM). This method only uses the query image and detailed text descriptions as inputs. 2. **Establish the optimal instruction configuration**: Through extensive experimental evaluations, the optimal instruction configuration for the MLLM used in CAPE has been determined, including defining the names and descriptions of key points for each category and identifying the most effective instruction format. 3. **Achieve the latest and best results**: On the MP - 100 benchmark dataset, CapeLLM outperforms the 5 - shot performance of existing methods in the 1 - shot setting, reaching the latest and best results. ### Core problems of the paper: - **Reduce the dependence on support images**: Traditional methods require a large number of support images and their key - point annotations, which not only increases the workload of data preparation but may also lead to insufficient generalization ability of the model. - **Improve the generalization ability of the model**: By using a pre - trained MLLM, CapeLLM can better understand text descriptions and thus perform better when facing unseen categories. - **Optimize the role of text descriptions**: Through detailed key - point descriptions, the model can more accurately infer the positions of key points, rather than simply relying on simple key - point names. ### Experimental verification: - **Quantitative results**: On the MP - 100 dataset, the PCK@0.2 index of CapeLLM in the 1 - shot setting exceeds existing methods and is even better than the results in the 5 - shot setting. - **Qualitative results**: In multiple categories, CapeLLM shows higher accuracy, especially in the key - point estimation of animal bodies, such as end joints like knees and paws, showing significant improvement. In conclusion, this paper solves the problem of dependence on support images in the CAPE task by introducing MLLM and improves the generalization ability and accuracy of the model through detailed text descriptions.

CapeLLM: Support-Free Category-Agnostic Pose Estimation with Multimodal Large Language Models

CapeX: Category-Agnostic Pose Estimation from Textual Point Explanation

Meta-Point Learning and Refining for Category-Agnostic Pose Estimation

SCAPE: A Simple and Strong Category-Agnostic Pose Estimator

Pose for Everything: Towards Category-Agnostic Pose Estimation

A Graph-Based Approach for Category-Agnostic Pose Estimation

Edge Weight Prediction For Category-Agnostic Pose Estimation

LocLLM: Exploiting Generalizable Human Keypoint Localization via Large Language Model

ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation

LAMP: Leveraging Language Prompts for Multi-person Pose Estimation

Towards Real-World Category-level Articulation Pose Estimation

CaMML: Context-Aware Multimodal Learner for Large Models

MolCA: Molecular Graph-Language Modeling with Cross-Modal Projector and Uni-Modal Adapter

CompCap: Improving Multimodal Large Language Models with Composite Captions

Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

LLM-Optic: Unveiling the Capabilities of Large Language Models for Universal Visual Grounding

Kestrel: Point Grounding Multimodal LLM for Part-Aware 3D Vision-Language Understanding

CAPE: Corrective Actions from Precondition Errors using Large Language Models

Instruction-guided Multi-Granularity Segmentation and Captioning with Large Multimodal Model

RoboLLM: Robotic Vision Tasks Grounded on Multimodal Large Language Models