Abstract:Humans can perform complex tasks with long-term objectives by planning, reasoning, and forecasting outcomes of actions. For embodied agents to achieve similar capabilities, they must gain knowledge of the environment transferable to novel scenarios with a limited budget of additional trial and error. Learning-based approaches, such as deep RL, can discover and take advantage of inherent regularities and characteristics of the application domain from data, and continuously improve their performances, however at a cost of large amounts of training data. This thesis explores the development of data-driven techniques for spatial reasoning and planning tasks, focusing on enhancing learning efficiency, interpretability, and transferability across novel scenarios. Four key contributions are made. 1) CALVIN, a differential planner that learns interpretable models of the world for long-term planning. It successfully navigated partially observable 3D environments, such as mazes and indoor rooms, by learning the rewards and state transitions from expert demonstrations. 2) SOAP, an RL algorithm that discovers options unsupervised for long-horizon tasks. Options segment a task into subtasks and enable consistent execution of the subtask. SOAP showed robust performances on history-conditional corridor tasks as well as classical benchmarks such as Atari. 3) LangProp, a code optimisation framework using LLMs to solve embodied agent problems that require reasoning by treating code as learnable policies. The framework successfully generated interpretable code with comparable or superior performance to human-written experts in the CARLA autonomous driving benchmark. 4) Voggite, an embodied agent with a vision-to-action transformer backend that solves complex tasks in Minecraft. It achieved third place in the MineRL BASALT Competition by identifying action triggers to segment tasks into multiple stages.

SAVE: Spatial-Attention Visual Exploration.

Learning Efficient Multi-Agent Cooperative Visual Exploration

Learning to Explore using Active Neural SLAM

Learning and Planning with a Semantic Model

Spatial Reasoning and Planning for Deep Embodied Agents

A Partially Supervised Reinforcement Learning Framework for Visual Active Search

Visual Local Path Planning Based on Deep Reinforcement Learning

Learning Exploration Policies for Navigation

Learning to Act with Affordance-Aware Multimodal Neural SLAM

Teaching Agents how to Map: Spatial Reasoning for Multi-Object Navigation

Deep Reinforcement Learning-based Large-scale Robot Exploration

Multigoal Visual Navigation With Collision Avoidance via Deep Reinforcement Learning

A target-driven visual navigation method based on intrinsic motivation exploration and space topological cognition

Cognitive Model of Agent Exploration with Vision and Signage Understanding

Meta-Explore: Exploratory Hierarchical Vision-and-Language Navigation Using Scene Object Spectrum Grounding

NAVS: A Neural Attention-Based Visual SLAM for Autonomous Navigation in Unknown 3D Environments

Target-Driven Structured Transformer Planner for Vision-Language Navigation

Pathdreamer: A World Model for Indoor Navigation

Loc4Plan: Locating Before Planning for Outdoor Vision and Language Navigation

Fast LiDAR Informed Visual Search in Unseen Indoor Environments

ActSafe: Active Exploration with Safety Constraints for Reinforcement Learning