Girolamo Macaluso,Alessandro Sestini,Andrew D. Bagdanov
Abstract:Offline Reinforcement Learning (ORL) is a promising approach to reduce the high sample complexity of traditional Reinforcement Learning (RL) by eliminating the need for continuous environmental interactions. ORL exploits a dataset of pre-collected transitions and thus expands the range of application of RL to tasks in which the excessive environment queries increase training time and decrease efficiency, such as in modern AAA games. This paper introduces OfflineMania a novel environment for ORL research. It is inspired by the iconic TrackMania series and developed using the Unity 3D game engine. The environment simulates a single-agent racing game in which the objective is to complete the track through optimal navigation. We provide a variety of datasets to assess ORL performance. These datasets, created from policies of varying ability and in different sizes, aim to offer a challenging testbed for algorithm development and evaluation. We further establish a set of baselines for a range of Online RL, ORL, and hybrid Offline to Online RL approaches using our environment.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the high sample complexity and low training efficiency faced by Reinforcement Learning (RL) when applied in modern AAA games. Specifically, traditional Online Reinforcement Learning (Online RL) requires a large number of environmental interactions to collect data, which is impractical in game development with intensive computing resources and slow environmental simulation speed. To overcome this challenge, the paper introduces a new Offline Reinforcement Learning (Offline RL) environment - OfflineMania, as well as a series of datasets, aiming to provide a standard test platform for researching and evaluating ORL algorithms.
### Main contributions:
1. **New environment**: Developed OfflineMania, which is a single - agent racing game environment based on the Unity 3D game engine, inspired by the classic TrackMania series of games. In this environment, the agent's task is to complete the track through optimal navigation.
2. **Diverse datasets**: Provided a variety of datasets with different scales and qualities, which are generated by strategies with different ability levels, aiming to provide a challenging test platform for algorithm development and evaluation.
3. **Benchmark test**: Conducted benchmark tests on multiple online RL, ORL, and hybrid offline - to - online RL methods, and provided detailed performance evaluation results.
### Detailed description of the environment and datasets:
- **Environment**:
- **State space**: The state space is a 33 - dimensional vector, containing 15 ray detection values (each ray has two values, indicating respectively whether there is an object in the path and the distance to the detected object), as well as the speed components of the vehicle.
- **Action space**: The action space consists of two continuous values, which respectively control the steering angle of the vehicle (ranging from - 1 to 1) and the acceleration or braking (1 represents full acceleration, - 1 represents braking or reversing).
- **Reward function**: The reward function is designed to measure the progress of the vehicle on the track while penalizing collisions. The specific formula is as follows:
\[
r_t = r_{\text{prog}}-\left(\lambda\|v_{\text{car}}\|\text{ if in contact with wall}\right)
\]
where \(r_{\text{prog}}\) represents the progress of the current position \(p_t\) relative to the current best position \(p_{\text{best}}\), \(v_{\text{car}}\) is the speed magnitude at the moment of collision, and \(\lambda\) is a fixed penalty coefficient (set to 50 in this paper).
- **Round**: At the beginning of each round, the position and direction of the vehicle are randomly initialized, and the round length is fixed at 2000 steps.
- **Datasets**:
- **Basic datasets**: Three datasets (basic, medium, expert) were generated using strategies at different training stages, and each dataset contains 100,000 transitions.
- **Mixed datasets**: Two mixed datasets (mix large and mix small) were created, where mix large contains 200,000 transitions and mix small contains 5,000 transitions. The transition ratios in the mixed datasets are 90% from the basic strategy, 7% from the medium strategy, and 3% from the expert strategy.
### Benchmark test results:
- **Online RL**:
- PPO reached an average reward of 1183 after 15 million environmental interactions and performed well.
- SAC reached an average reward of 215 after 3 million environmental interactions and performed poorly.
- **Offline RL**:
- IQL performed the best on all datasets, especially on the expert dataset, it even outperformed the strategy that generated this dataset.
- TD3BC and CQL performed worse than IQL on all datasets, especially on the expert dataset.