Vision-Language Navigation with Energy-Based Policy

Rui Liu,Wenguan Wang,Yi Yang
2024-10-18
Abstract:Vision-language navigation (VLN) requires an agent to execute actions following human instructions. Existing VLN models are optimized through expert demonstrations by supervised behavioural cloning or incorporating manual reward engineering. While straightforward, these efforts overlook the accumulation of errors in the Markov decision process, and struggle to match the distribution of the expert policy. Going beyond this, we propose an Energy-based Navigation Policy (ENP) to model the joint state-action distribution using an energy-based model. At each step, low energy values correspond to the state-action pairs that the expert is most likely to perform, and vice versa. Theoretically, the optimization objective is equivalent to minimizing the forward divergence between the occupancy measure of the expert and ours. Consequently, ENP learns to globally align with the expert policy by maximizing the likelihood of the actions and modeling the dynamics of the navigation states in a collaborative manner. With a variety of VLN architectures, ENP achieves promising performances on R2R, REVERIE, RxR, and R2R-CE, unleashing the power of existing VLN models.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve several key challenges in the Vision - Language Navigation (VLN) task. Specifically, the author points out the limitations of existing VLN models and proposes a new method to overcome these problems. #### 1. **Limitations of Existing Methods** Existing VLN models are mainly optimized through expert demonstrations, using methods such as Behavioral Cloning (BC) or manually - designed reward functions. However, these methods have the following problems: - **Error Accumulation**: In partially observable Markov decision processes, BC is prone to cause quadratic accumulation of errors, especially in the case of long trajectories. - **Distribution Mismatch**: BC and traditional reinforcement learning methods are difficult to fully align with the distribution of expert policies, especially in unseen environments. - **Complex Reward Engineering**: Manually designing reward functions not only requires a large amount of adjustment but may also not be robust enough to environmental dynamic changes and is difficult to generalize to different scenarios. #### 2. **The Proposed New Method** To solve the above problems, the author proposes the Energy - based Navigation Policy (ENP). The main contributions of ENP include: - **Modeling the Joint State - Action Distribution**: ENP models the joint distribution \(P(s, a)\) of state - action pairs through an Energy - based Model (EBM), rather than only optimizing the conditional action distribution \(P(a|s)\). - **Globally Aligning with Expert Policies**: ENP achieves global alignment with expert policies by maximizing the likelihood of actions and modeling the dynamics of navigation states. - **The Optimization Objective is Equivalent to Minimizing the Forward Divergence**: The optimization objective of ENP is equivalent to minimizing the forward KL - divergence between the expert occupancy measure and its own occupancy measure. #### 3. **Theoretical Advantages** Theoretically, the optimization objective of ENP is to maximize the expected log - likelihood function of the joint distribution \(P(s, a)\). This is equivalent to estimating the unnormalized probability density (i.e., energy) of the expert occupancy measure, thereby achieving preferential optimization of the entire trajectory, not just single - step decisions. #### 4. **Experimental Verification** The author conducted experiments on multiple VLN benchmark datasets, including R2R, REVERIE, RxR, and R2R - CE. The experimental results show that ENP significantly outperforms existing methods on these datasets, with obvious performance improvements on different metrics respectively. ### Summary This paper attempts to solve the limitations of existing VLN models in terms of error accumulation, distribution mismatch, and reward engineering complexity by introducing the Energy - based Navigation Policy (ENP). ENP achieves global alignment with expert policies by modeling the joint state - action distribution and shows superior performance on multiple benchmark datasets.