Abstract:Language is never spoken in a vacuum. It is expressed, comprehended, and contextualized within the holistic backdrop of the speaker's history, actions, and environment. Since humans are used to communicating efficiently with situated language, the practicality of robotic assistants hinge on their ability to understand and act upon implicit and situated instructions. In traditional instruction following paradigms, the agent acts alone in an empty house, leading to language use that is both simplified and artificially "complete." In contrast, we propose situated instruction following, which embraces the inherent underspecification and ambiguity of real-world communication with the physical presence of a human speaker. The meaning of situated instructions naturally unfold through the past actions and the expected future behaviors of the human involved. Specifically, within our settings we have instructions that (1) are ambiguously specified, (2) have temporally evolving intent, (3) can be interpreted more precisely with the agent's dynamic actions. Our experiments indicate that state-of-the-art Embodied Instruction Following (EIF) models lack holistic understanding of situated human intention.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to improve the ability of robot assistants to understand and execute contextualized language instructions. Specifically, the authors point out that current instruction - following tasks mainly focus on low - level instruction interpretation or using common sense to achieve unspecified goals, but these tasks often assume that the environment is static and the instructions are fully specified. However, in the real world, human communication is often ambiguous, intentions evolve over time, and the environment is dynamically changing. Therefore, this paper proposes the concept of "Situated Instruction Following (SIF)", aiming to enable robots to better understand and respond to such complex contextualized instructions. ### Core Problems in the Paper 1. **Ambiguity**: Instructions may have multiple possible interpretations. For example, "Bring me a cup" may have different meanings in different rooms. 2. **Temporal**: As time passes, the intention of the instruction may change. For example, "I will go to the bathroom to wash my face", at this time, the robot needs to judge the specific location of the bathroom according to the person's actions. 3. **Dynamic**: People and objects in the environment may move, and the robot needs to adjust its behavior according to these dynamic changes. ### Solutions To evaluate and improve this ability of robots, the authors designed a new dataset and benchmarking framework, including the following key elements: - **Exploration Phase**: The robot explores in a static environment to obtain environmental layout information. - **Task Phase**: Some objects are re - positioned, the robot receives an instruction (such as "Bring me a cup"), and gets hints about object movement (such as "I took the cup, I'm going to wash my face"). The robot needs to complete the task according to this information and the person's actions. ### Experimental Setup - **Environment**: Use the Habitat 3.0 simulator, including static and dynamic tasks. - **Task Types**: - **Static Task (PnP)**: Objects and people do not move. - **Situated Object Task (Sobj)**: Objects are re - positioned before the task starts. - **Situated Person Task (Shum)**: People start to move after the task starts. ### Baseline Models - **Reasoner**: A closed - loop system that combines a semantic map, a hint generator, and a large language model (LLM) planner. - **Prompter**: An open - loop system for performing ALFRED tasks. ### Main Contributions - **Introducing the SIF Concept**: Emphasizes the challenges that robots face when dealing with instructions in ambiguous, temporally evolving, and dynamic environments. - **New Dataset**: Designed a dataset containing static and dynamic tasks to evaluate the comprehensive ability of robots. - **Baseline Models**: Implemented two high - performance baseline models, showing the limitations of existing methods in handling SIF tasks. Through these works, the authors hope to promote robot assistants to be closer to the human level in understanding natural language instructions, thus being more effective and practical in practical applications.

Situated Instruction Following

Situated Multimodal Control of a Mobile Robot: Navigation through a Virtual Environment

Embodied Instruction Following in Unknown Environments

ThinkBot: Embodied Instruction Following with Thought Chain Reasoning

Infer Human's Intentions Before Following Natural Language Instructions

Verifiably Following Complex Robot Instructions with Foundation Models

SIFToM: Robust Spoken Instruction Following through Theory of Mind

Pragmatic Instruction Following and Goal Assistance via Cooperative Language-Guided Inverse Planning

Situated Language Learning via Interactive Narratives

FILM: Following Instructions in Language with Modular Methods

Learning Models for Following Natural Language Directions in Unknown Environments

SGL: Symbolic Goal Learning in a Hybrid, Modular Framework for Human Instruction Following

Can LLMs Generate Human-Like Wayfinding Instructions? Towards Platform-Agnostic Embodied Instruction Synthesis

Are We There Yet? Learning to Localize in Embodied Instruction Following

Flexibly Instructable Agents

Goal Representations for Instruction Following: A Semi-Supervised Language Interface to Control

tagE: Enabling an Embodied Agent to Understand Human Instructions

Teaching Robots Where To Go And How To Act With Human Sketches via Spatial Diagrammatic Instructions

Continual Learning for Instruction Following from Realtime Feedback

Object-Centric Instruction Augmentation for Robotic Manipulation

Accessible Instruction-Following Agent