RefHCM: A Unified Model for Referring Perceptions in Human-Centric Scenarios

Jie Huang,Ruibing Hou,Jiahe Zhao,Hong Chang,Shiguang Shan
2024-12-19
Abstract:Human-centric perceptions play a crucial role in real-world applications. While recent human-centric works have achieved impressive progress, these efforts are often constrained to the visual domain and lack interaction with human instructions, limiting their applicability in broader scenarios such as chatbots and sports analysis. This paper introduces Referring Human Perceptions, where a referring prompt specifies the person of interest in an image. To tackle the new task, we propose RefHCM (Referring Human-Centric Model), a unified framework to integrate a wide range of human-centric referring tasks. Specifically, RefHCM employs sequence mergers to convert raw multimodal data -- including images, text, coordinates, and parsing maps -- into semantic tokens. This standardized representation enables RefHCM to reformulate diverse human-centric referring tasks into a sequence-to-sequence paradigm, solved using a plain encoder-decoder transformer architecture. Benefiting from a unified learning strategy, RefHCM effectively facilitates knowledge transfer across tasks and exhibits unforeseen capabilities in handling complex reasoning. This work represents the first attempt to address referring human perceptions with a general-purpose framework, while simultaneously establishing a corresponding benchmark that sets new standards for the field. Extensive experiments showcase RefHCM's competitive and even superior performance across multiple human-centric referring tasks. The code and data are publicly at <a class="link-external link-https" href="https://github.com/JJJYmmm/RefHCM" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the limitations of existing human - centric perception models. Specifically, the existing models usually focus on a single task and have difficulties in handling multi - modal inputs and outputs, which restricts their applications in broader scenarios, such as chatbots and sports analysis. Moreover, these models lack the ability to interact with human instructions, resulting in limited applicability in practical applications. To solve these problems, the paper proposes RefHCM (Referring Human - Centric Model), which is a unified framework aiming to integrate multiple human - centric referring tasks. The main contributions of RefHCM include: 1. **Unified representation space**: The original multi - modal data (such as images, texts, coordinates and parse graphs) are converted into semantic tokens through sequence mergers. This standardized representation enables RefHCM to re - formulate different referring tasks as sequence - to - sequence problems and use a simple encoder - decoder Transformer architecture to solve these problems. 2. **Universal network architecture and optimization objective**: RefHCM adopts a unified learning strategy, which promotes knowledge transfer across tasks and demonstrates the ability to handle complex reasoning. This architecture not only simplifies the model design but also enhances the synergy between different tasks. 3. **New benchmark**: The paper introduces a new benchmark (ReasonRef Benchmark) for evaluating the performance of the model in handling implicit references, which require complex reasoning abilities. This benchmark covers tasks in multiple dimensions, such as identity recognition, pose / clothing description, social relations, physical relations and future prediction. 4. **Zero - shot generalization ability**: Although RefHCM is only trained on simple direct references, it shows excellent zero - shot generalization ability in complex reasoning tasks, indicating its strong adaptability and flexibility. In conclusion, by proposing the RefHCM framework, this paper solves the deficiencies of existing models in multi - modal processing and task unification, and provides a more general and efficient solution for human - centric perception tasks.