Abstract:Vision-and-language navigation requires an agent to navigate in a photo-realistic environment by following natural language instructions. Mainstream methods employ imitation learning (IL) to let the agent imitate the behavior of the teacher. The trained model will overfit the teacher's biased behavior, resulting in poor model generalization. Recently, researchers have sought to combine IL and reinforcement learning (RL) to overcome overfitting and enhance model generalization. However, these methods still face the problem of expensive trajectory annotation. We propose a hierarchical RL-based method-discovering intrinsic subgoals via hierarchical (DISH) RL-which overcomes the generalization limitations of current methods and gets rid of expensive label annotations. First, the high-level agent (manager) decomposes the complex navigation problem into simple intrinsic subgoals. Then, the low-level agent (worker) uses an intrinsic subgoal-driven attention mechanism for action prediction in a smaller state space. We place no constraints on the semantics that subgoals may convey, allowing the agent to autonomously learn intrinsic, more generalizable subgoals from navigation tasks. Furthermore, we design a novel history-aware discriminator (HAD) for the worker. The discriminator incorporates historical information into subgoal discrimination and provides the worker with additional intrinsic rewards to alleviate the reward sparsity. Without labeled actions, our method provides supervision for the worker in the form of self-supervision by generating subgoals from the manager. The final results of multiple comparison experiments on the Room-to-Room (R2R) dataset show that our DISH can significantly outperform the baseline in accuracy and efficiency.

Following Instructions by Imagining and Reaching Visual Goals

From Language to Goals: Inverse Reinforcement Learning for Vision-Based Instruction Following

Learning Sparse Control Tasks from Pixels by Latent Nearest-Neighbor-Guided Explorations

Image-Based Deep Reinforcement Learning with Intrinsically Motivated Stimuli: On the Execution of Complex Robotic Tasks

Visual Foresight: Model-Based Deep Reinforcement Learning for Vision-Based Robotic Control

A New Path: Scaling Vision-and-Language Navigation with Synthetic Instructions and Imitation Learning

Discovering Intrinsic Subgoals for Vision-and-Language Navigation via Hierarchical Reinforcement Learning

End-to-End Robotic Reinforcement Learning without Reward Engineering

Affordance-Guided Reinforcement Learning via Visual Prompting

Goal Representations for Instruction Following: A Semi-Supervised Language Interface to Control

Instruction Following with Goal-Conditioned Reinforcement Learning in Virtual Environments

Robust Policies via Mid-Level Visual Representations: An Experimental Study in Manipulation and Navigation

SGL: Symbolic Goal Learning in a Hybrid, Modular Framework for Human Instruction Following

Learning from Pixels with Expert Observations

Task-Induced Representation Learning

Vision-Language Models Provide Promptable Representations for Reinforcement Learning

Reinforcement Learning Meets Visual Odometry

Multigoal Visual Navigation With Collision Avoidance via Deep Reinforcement Learning

Visual Reinforcement Learning with Self-Supervised 3D Representations

Robot Perception enables Complex Navigation Behavior via Self-Supervised Learning

Example-Driven Model-Based Reinforcement Learning for Solving Long-Horizon Visuomotor Tasks