Deep Reinforcement Learning with Swin Transformers

Li Meng,Morten Goodwin,Anis Yazidi,Paal Engelstad
2024-06-24
Abstract:Transformers are neural network models that utilize multiple layers of self-attention heads and have exhibited enormous potential in natural language processing tasks. Meanwhile, there have been efforts to adapt transformers to visual tasks of machine learning, including Vision Transformers and Swin Transformers. Although some researchers use Vision Transformers for reinforcement learning tasks, their experiments remain at a small scale due to the high computational cost. This article presents the first online reinforcement learning scheme that is based on Swin Transformers: Swin DQN. In contrast to existing research, our novel approach demonstrate the superior performance with experiments on 49 games in the Arcade Learning Environment. The results show that our approach achieves significantly higher maximal evaluation scores than the baseline method in 45 of all the 49 games (92%), and higher mean evaluation scores than the baseline method in 40 of all the 49 games (82%).
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to apply Swin Transformers to deep reinforcement learning (DRL) tasks, especially to improve the model performance in image - based reinforcement learning environments. Specifically, the paper introduces a new online reinforcement learning scheme - Swin DQN, which combines the advantages of Swin Transformers to improve the traditional Deep Q - Network (DQN) method. Through experiments in 49 Atari games, it is proved that compared with the traditional DQN method, Swin DQN has a significant improvement in both the maximum evaluation score and the average evaluation score, especially in game scenarios where high - complexity features need to be processed or fine - grained world modeling is required. The main contributions of the paper are as follows: 1. **Proposing Swin DQN**: For the first time, Swin Transformers are applied to online reinforcement learning tasks, solving the problem that traditional Vision Transformers are difficult to be applied on a large scale in reinforcement learning tasks due to high computational costs. 2. **Performance improvement**: The experimental results show that in 49 Atari games, 92% of the games have a maximum evaluation score higher than the baseline method for Swin DQN, and 82% of the games have an average evaluation score higher than the baseline method. 3. **Local self - attention mechanism**: Swin Transformers reduce the computational complexity and improve the model efficiency by grouping image pixels into small tokenized patches and applying local self - attention operations within a fixed - size window. In conclusion, this paper aims to improve the performance of deep reinforcement learning models by introducing Swin Transformers, especially in visual tasks, and shows the superior performance of Swin DQN in multiple Atari games.