Abstract:There has been a lot of interest in grounding natural language to physical entities through visual context. While Vision Language Models (VLMs) can ground linguistic instructions to visual sensory information, they struggle with grounding non-visual attributes, like the weight of an object. Our key insight is that non-visual attribute detection can be effectively achieved by active perception guided by visual reasoning. To this end, we present a perception-action programming API that consists of VLMs and Large Language Models (LLMs) as backbones, together with a set of robot control functions. When prompted with this API and a natural language query, an LLM generates a program to actively identify attributes given an input image. Offline testing on the Odd-One-Out dataset demonstrates that our framework outperforms vanilla VLMs in detecting attributes like relative object location, size, and weight. Online testing in realistic household scenes on AI2-THOR and a real robot demonstration on a DJI RoboMaster EP robot highlight the efficacy of our approach.

What problem does this paper attempt to address?

The core problem that this paper attempts to solve is **how to accurately identify the attributes of objects by combining visual - language models (VLMs) and large - language models (LLMs), especially those non - visual - perception attributes (such as weight)**. Specifically, the author focuses on how to identify and locate specific attributes of objects through active - perception behaviors when robots execute natural - language instructions. ### Problem Background In the real world, robots need to understand natural - language instructions and perform corresponding tasks. To complete these tasks, robots must be able to identify and distinguish objects in the scene and their attributes. However, although traditional visual - language models (VLMs) can associate language instructions with visual information, they have limitations when dealing with non - visual attributes (such as weight, hardness, etc.). In addition, attribute detection often depends on static images and ignores the relative relationships and changes between objects in a dynamic environment. ### Main Contributions of the Paper To solve the above problems, the author proposes a **Perception - Action Programming API**, which combines VLMs and LLMs and introduces robot - control functions. In this way, the model can generate programs according to natural - language queries and use the robot's sensors (such as cameras, distance sensors, force / torque sensors, etc.) to actively identify the attributes of objects. Specific contributions include: 1. **Pointing out the limitations of using VLMs for attribute detection**: When using VLMs alone for attribute detection, the context information in the environment may be ignored, leading to misjudgments. 2. **Proposing the Perception - Action API**: By integrating visual reasoning and robot - control functions, the API can guide robots to perform active - perception behaviors, thereby more accurately identifying object attributes. 3. **Releasing an end - to - end framework**: The author provides a complete framework that can be deployed on a real - robot platform, demonstrating its effectiveness in practical applications. ### Experimental Verification To verify the proposed framework, the author conducted multiple experiments, including: - **Spatial - reasoning experiments**: Evaluate the performance of the API in complex spatial queries. The results show that the API is superior to traditional open - vocabulary object - detection (OVD) models when dealing with queries related to relative positions and sizes. - **Non - visual - perception - attribute experiments**: Taking the weight of an object as an example, evaluate the performance of the model when dealing with non - visual - perception attributes. The results show that the framework combining VLMs and LLMs can significantly improve performance. - **Real - robot demonstrations**: Tests were carried out in the AI2 - THOR simulation environment and on a real robot (DJI RoboMaster EP), demonstrating the feasibility of the framework in practical application scenarios. ### Conclusion By introducing the Perception - Action Programming API, this paper successfully solves the limitations of traditional VLMs in attribute detection, especially when dealing with non - visual - perception attributes. This method not only improves the accuracy of attribute detection but also provides new ideas for robots to understand and execute natural - language instructions.

Discovering Object Attributes by Prompting Large Language Models with Perception-Action APIs

AP-VLM: Active Perception Enabled by Vision-Language Models

Physically Grounded Vision-Language Models for Robotic Manipulation

A Survey on Vision-Language-Action Models for Embodied AI

Vision-Language Model-based Physical Reasoning for Robot Liquid Perception

Expanding Frozen Vision-Language Models without Retraining: Towards Improved Robot Perception

OVAL-Prompt: Open-Vocabulary Affordance Localization for Robot Manipulation through LLM Affordance-Grounding

From Goal-Conditioned to Language-Conditioned Agents via Vision-Language Models

Grounding Language with Visual Affordances over Unstructured Data

LADEV: A Language-Driven Testing and Evaluation Platform for Vision-Language-Action Models in Robotic Manipulation

Distilling Internet-Scale Vision-Language Models into Embodied Agents

LLM+ A: Grounding Large Language Models in Physical World with Affordance Prompting

Leveraging Retrieval-Augmented Tags for Large Vision-Language Understanding in Complex Scenes

Articulate-Anything: Automatic Modeling of Articulated Objects via a Vision-Language Foundation Model

A3VLM: Actionable Articulation-Aware Vision Language Model

Empowering Large Language Models on Robotic Manipulation with Affordance Prompting

A joint model of language and perception for grounded attribute learning

A little less conversation, a little more action, please: Investigating the physical common-sense of LLMs in a 3D embodied environment

MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World

CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

Octopus: Embodied Vision-Language Programmer from Environmental Feedback