ENTL: Embodied Navigation Trajectory Learner

Klemen Kotar,Aaron Walsman,Roozbeh Mottaghi
2023-09-29
Abstract:We propose Embodied Navigation Trajectory Learner (ENTL), a method for extracting long sequence representations for embodied navigation. Our approach unifies world modeling, localization and imitation learning into a single sequence prediction task. We train our model using vector-quantized predictions of future states conditioned on current states and actions. ENTL's generic architecture enables sharing of the spatio-temporal sequence encoder for multiple challenging embodied tasks. We achieve competitive performance on navigation tasks using significantly less data than strong baselines while performing auxiliary tasks such as localization and future frame prediction (a proxy for world modeling). A key property of our approach is that the model is pre-trained without any explicit reward signal, which makes the resulting model generalizable to multiple tasks and environments.
Computer Vision and Pattern Recognition,Robotics
What problem does this paper attempt to address?
The paper attempts to address the problem of effectively handling long sequence data in Embodied AI tasks and learning navigation tasks without explicit reward signals. Specifically, the authors propose a method called "Embodied Navigation Trajectory Learner (ENTL)" with the following objectives: 1. **Unify Multiple Tasks**: Simplify the differences between various tasks by unifying world modeling, localization, and imitation learning into a single sequence prediction task. 2. **Long Sequence Representation**: Utilize the Transformer architecture to handle sequences of up to 50 steps or more, overcoming the challenges faced by traditional methods in processing long sequences. 3. **Self-Supervised Pretraining**: Pretrain the model using future frame prediction as a self-supervised task, thereby avoiding the need for explicit reward signals and enabling the model to generalize to different tasks and environments. 4. **Efficient Data Utilization**: Achieve competitive performance with significantly less data compared to other baseline models. In summary, ENTL aims to address the challenge of long sequence processing in current Embodied AI tasks through a novel approach, enhancing the model's generalization ability and data utilization efficiency via self-supervised learning.