Racing with Vision Transformer Architecture

Chengwen Tian,Liang Song
DOI: https://doi.org/10.1109/ishc56805.2022.00055
2022-01-01
Abstract:Vision-based reinforcement learning (RL) holds great potential for addressing complex decision-making problems and has benefited numerous research domains, such as game intelligence, medical diagnosis, and autonomous driving. In previous works, deep reinforcement learning (DRL) frameworks combining CNNs and value-based algorithms were applied to Atari games and achieved significant performance. However, exploring these applications further using recent advanced deep models remains an open problem. In this work, we introduce the Vision Transformer $(\text{ViT})$ model as a feature extraction network in comparison to the traditional CNN model to explore the potential of ViT in RL. The performance of both models is tested in the CarRacing-v0 pixel environment and analyzed in terms of sampling efficiency and algorithmic stability. The results show that $\text{ViT}$ brings innovation by utilizing its self-attention mechanism for processing images, which provides evidence that $\text{ViT}$ is a promising solution for RL and opens up new avenues for future work.
What problem does this paper attempt to address?