Abstract:Transformers are neural network models that utilize multiple layers of self-attention heads and have exhibited enormous potential in natural language processing tasks. Meanwhile, there have been efforts to adapt transformers to visual tasks of machine learning, including Vision Transformers and Swin Transformers. Although some researchers use Vision Transformers for reinforcement learning tasks, their experiments remain at a small scale due to the high computational cost. This article presents the first online reinforcement learning scheme that is based on Swin Transformers: Swin DQN. In contrast to existing research, our novel approach demonstrate the superior performance with experiments on 49 games in the Arcade Learning Environment. The results show that our approach achieves significantly higher maximal evaluation scores than the baseline method in 45 of all the 49 games (92%), and higher mean evaluation scores than the baseline method in 40 of all the 49 games (82%).

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to apply Swin Transformers to deep reinforcement learning (DRL) tasks, especially to improve the model performance in image - based reinforcement learning environments. Specifically, the paper introduces a new online reinforcement learning scheme - Swin DQN, which combines the advantages of Swin Transformers to improve the traditional Deep Q - Network (DQN) method. Through experiments in 49 Atari games, it is proved that compared with the traditional DQN method, Swin DQN has a significant improvement in both the maximum evaluation score and the average evaluation score, especially in game scenarios where high - complexity features need to be processed or fine - grained world modeling is required. The main contributions of the paper are as follows: 1. **Proposing Swin DQN**: For the first time, Swin Transformers are applied to online reinforcement learning tasks, solving the problem that traditional Vision Transformers are difficult to be applied on a large scale in reinforcement learning tasks due to high computational costs. 2. **Performance improvement**: The experimental results show that in 49 Atari games, 92% of the games have a maximum evaluation score higher than the baseline method for Swin DQN, and 82% of the games have an average evaluation score higher than the baseline method. 3. **Local self - attention mechanism**: Swin Transformers reduce the computational complexity and improve the model efficiency by grouping image pixels into small tokenized patches and applying local self - attention operations within a fixed - size window. In conclusion, this paper aims to improve the performance of deep reinforcement learning models by introducing Swin Transformers, especially in visual tasks, and shows the superior performance of Swin DQN in multiple Atari games.

Deep Reinforcement Learning with Swin Transformers

Self-Supervised Learning with Swin Transformers

Transformer Based Reinforcement Learning For Games

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Swin Transformer V2: Scaling Up Capacity and Resolution

CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows

IO Transformer: Evaluating SwinV2-Based Reward Models for Computer Vision

Multi-Game Decision Transformers

SwinFace: A Multi-task Transformer for Face Recognition, Expression Recognition, Age Estimation and Attribute Estimation

Degenerate Swin to Win: Plain Window-based Transformer without Sophisticated Operations

On Transforming Reinforcement Learning With Transformers: The Development Trajectory

Racing with Vision Transformer Architecture

On Transforming Reinforcement Learning by Transformer: The Development Trajectory

Improved deep learning image classification algorithm based on Swin Transformer V2

Evaluating Vision Transformer Methods for Deep Reinforcement Learning from Pixels

AGaLiTe: Approximate Gated Linear Transformers for Online Reinforcement Learning

Mastering Chess with a Transformer Model

SwinVI:3D Swin Transformer Model with U-net for Video Inpainting.