Abstract:Humans can perform complex tasks with long-term objectives by planning, reasoning, and forecasting outcomes of actions. For embodied agents to achieve similar capabilities, they must gain knowledge of the environment transferable to novel scenarios with a limited budget of additional trial and error. Learning-based approaches, such as deep RL, can discover and take advantage of inherent regularities and characteristics of the application domain from data, and continuously improve their performances, however at a cost of large amounts of training data. This thesis explores the development of data-driven techniques for spatial reasoning and planning tasks, focusing on enhancing learning efficiency, interpretability, and transferability across novel scenarios. Four key contributions are made. 1) CALVIN, a differential planner that learns interpretable models of the world for long-term planning. It successfully navigated partially observable 3D environments, such as mazes and indoor rooms, by learning the rewards and state transitions from expert demonstrations. 2) SOAP, an RL algorithm that discovers options unsupervised for long-horizon tasks. Options segment a task into subtasks and enable consistent execution of the subtask. SOAP showed robust performances on history-conditional corridor tasks as well as classical benchmarks such as Atari. 3) LangProp, a code optimisation framework using LLMs to solve embodied agent problems that require reasoning by treating code as learnable policies. The framework successfully generated interpretable code with comparable or superior performance to human-written experts in the CARLA autonomous driving benchmark. 4) Voggite, an embodied agent with a vision-to-action transformer backend that solves complex tasks in Minecraft. It achieved third place in the MineRL BASALT Competition by identifying action triggers to segment tasks into multiple stages.

Sparse Graphical Memory for Robust Planning

Graph schemas as abstractions for transfer learning, inference, and planning

Modeling Dynamic Environments with Scene Graph Memory

Spiking Reinforcement Learning with Memory Ability for Mapless Navigation

Structured Scene Memory for Vision-Language Navigation

Sparse Multilevel Roadmaps for High-Dimensional Robot Motion Planning

Goal-Space Planning with Subgoal Models

MASP: Scalable GNN-based Planning for Multi-Agent Navigation

Learning Efficient Multi-Agent Cooperative Visual Exploration

Spatial Reasoning and Planning for Deep Embodied Agents

Learning a World Model With Multitimescale Memory Augmentation

Hierarchical Representations and Explicit Memory: Learning Effective Navigation Policies on 3D Scene Graphs using Graph Neural Networks

Sparsified Subgraph Memory for Continual Graph Representation Learning

Combining Subgoal Graphs with Reinforcement Learning to Build a Rational Pathfinder

Learning and Planning with a Semantic Model

SayPlan: Grounding Large Language Models using 3D Scene Graphs for Scalable Task Planning

Memory Proxy Maps for Visual Navigation

Scalable Spatial Memory for Scene Rendering and Navigation

Cognitive Mapping and Planning for Visual Navigation

Cognitive Planning for Object Goal Navigation using Generative AI Models

SayPlan: Grounding Large Language Models using 3D Scene Graphs for Scalable Robot Task Planning