Abstract:Vision-and-language navigation requires an agent to navigate in a photo-realistic environment by following natural language instructions. Mainstream methods employ imitation learning (IL) to let the agent imitate the behavior of the teacher. The trained model will overfit the teacher's biased behavior, resulting in poor model generalization. Recently, researchers have sought to combine IL and reinforcement learning (RL) to overcome overfitting and enhance model generalization. However, these methods still face the problem of expensive trajectory annotation. We propose a hierarchical RL-based method-discovering intrinsic subgoals via hierarchical (DISH) RL-which overcomes the generalization limitations of current methods and gets rid of expensive label annotations. First, the high-level agent (manager) decomposes the complex navigation problem into simple intrinsic subgoals. Then, the low-level agent (worker) uses an intrinsic subgoal-driven attention mechanism for action prediction in a smaller state space. We place no constraints on the semantics that subgoals may convey, allowing the agent to autonomously learn intrinsic, more generalizable subgoals from navigation tasks. Furthermore, we design a novel history-aware discriminator (HAD) for the worker. The discriminator incorporates historical information into subgoal discrimination and provides the worker with additional intrinsic rewards to alleviate the reward sparsity. Without labeled actions, our method provides supervision for the worker in the form of self-supervision by generating subgoals from the manager. The final results of multiple comparison experiments on the Room-to-Room (R2R) dataset show that our DISH can significantly outperform the baseline in accuracy and efficiency.

Long-Sighted Imitation Learning for Partially Observable Control

HILONet: Hierarchical Imitation Learning from Non-Aligned Observations

Learning Safety-Aware Policy with Imitation Learning for Context-Adaptive Navigation

Robust Visual Imitation Learning with Inverse Dynamics Representations

Keyframe-Focused Visual Imitation Learning

Discovering Intrinsic Subgoals for Vision-and-Language Navigation via Hierarchical Reinforcement Learning

Seeing Differently, Acting Similarly: Heterogeneously Observable Imitation Learning

Imitator Learning: Achieve Out-of-the-Box Imitation Ability in Variable Environments

Imitation Learning with Human Eye Gaze via Multi-Objective Prediction

Off-policy Imitation Learning from Visual Inputs

Extraneousness-Aware Imitation Learning

Online Multi-modal Imitation Learning Via Lifelong Intention Encoding.

Hierarchical Interpretable Imitation Learning for End-to-End Autonomous Driving

Yaw-Guided Imitation Learning for Autonomous Driving in Urban Environments

Limited Preference Aided Imitation Learning from Imperfect Demonstrations

VOILA: Visual-Observation-Only Imitation Learning for Autonomous Navigation

Imperative Learning: A Self-supervised Neural-Symbolic Learning Framework for Robot Autonomy

RIRL: A Recurrent Imitation and Reinforcement Learning Method for Long-Horizon Robotic Tasks

Synergistic Reinforcement and Imitation Learning for Vision-driven Autonomous Flight of UAV Along River

Zero-shot Imitation Learning from Demonstrations for Legged Robot Visual Navigation

Deep Recurrent Q-Learning for Partially Observable MDPs