Learning Visual Representation for Autonomous Drone Navigation Via a Contrastive World Model

Jiang Zhao,Yibo Wang,Zhihao Cai,Ningjun Liu,Kun Wu,Yingxun Wang
DOI: https://doi.org/10.1109/tai.2023.3283488
2024-01-01
IEEE Transactions on Artificial Intelligence
Abstract:Visuomotor policy learning for vision-based navigation tasks is still challenging and necessary for autonomous systems. Learning a task-specific policy from scratch simplifies the training pipeline while suffering from poor data efficiency and transfer ability. This problem intends to be more intractable under a low-data regime. In this work, we present a self-supervised representation learning architecture that incorporates spatial and temporal information via a contrastive world model (STC) to extract image representation for vision-based navigation tasks. Specifically, STC leverages the dynamics transition model based on a recurrent neural network to construct a joint low-dimensional latent space for spatial and temporal representations. We simultaneously optimize all components of this architecture using a multi-objective contrastive training loss. The resulting pretrained encoder model acts as a standalone feature extractor to promote the policy learning procedure. We evaluate the final optimized visuomotor policy on both the simulated drone navigation environment and the out-of-domain dataset. Experimental results demonstrate that our proposed method outperforms task-specific and representative contrastive learning baselines in challenging complex visual environments with more than half the improvement in data efficiency and provides significant gains in learning speed as well as final performance. Code and video are available at: https://github.com/yibow-wang/cwm4drone .
What problem does this paper attempt to address?