Abstract:Sample efficiency in Reinforcement Learning (RL) has traditionally been driven by algorithmic enhancements. In this work, we demonstrate that scaling can also lead to substantial improvements. We conduct a thorough investigation into the interplay of scaling model capacity and domain-specific RL enhancements. These empirical findings inform the design choices underlying our proposed BRO (Bigger, Regularized, Optimistic) algorithm. The key innovation behind BRO is that strong regularization allows for effective scaling of the critic networks, which, paired with optimistic exploration, leads to superior performance. BRO achieves state-of-the-art results, significantly outperforming the leading model-based and model-free algorithms across 40 complex tasks from the DeepMind Control, MetaWorld, and MyoSuite benchmarks. BRO is the first model-free algorithm to achieve near-optimal policies in the notoriously challenging Dog and Humanoid tasks.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **In continuous control tasks, how to significantly improve the sample efficiency of Reinforcement Learning (RL) through model scaling**. Specifically, the authors challenged the traditional practice of relying on algorithm improvements to improve sample efficiency and proposed that by expanding the model capacity and introducing domain - specific RL enhancement techniques, better performance can be achieved in complex tasks.
### Main problems and goals
1. **Limitations of traditional methods**:
- Previous research has mainly focused on improving sample efficiency through algorithm improvements (such as solving value overestimation, exploration strategies, etc.).
- However, these methods usually rely on smaller network architectures and may not be able to fully utilize the advantages of large - scale models.
2. **Possibilities of model expansion**:
- The paper explored whether significantly improving performance can be achieved by increasing the number of model parameters and the number of gradient updates (i.e., expanding the model capacity and the proportion of the replay buffer).
- Especially in continuous control tasks, whether this method can surpass the existing model - free and model - based methods.
3. **Key innovation points**:
- A new algorithm, BRO (Bigger, Regularized, Optimistic), was introduced, which combines a larger critic network, strong regularization, and optimistic exploration.
- Through extensive experimental verification, BRO can significantly outperform existing methods in multiple complex benchmarks, especially on benchmarks such as DeepMind Control, MetaWorld, and MyoSuite.
### Core elements of the solution
- **Bigger**: Use a larger - scale critic network (about 5M parameters), which is about 7 times larger than the traditional SAC model.
- **Regularized**: Adopt the BroNet architecture, including Layer Normalization, weight decay, and full - parameter resets, to ensure the stability and performance of the expanded model.
- **Optimistic**: Utilize dual - policy optimistic exploration and non - pessimistic quantile Q - value approximation to balance exploration and exploitation.
### Experimental results
- BRO significantly outperforms the existing model - free and model - based methods in 40 complex tasks, especially achieving near - optimal policies in Dog and Humanoid tasks.
- The BRO (Fast) version significantly improves computational efficiency while maintaining sample efficiency and can complete training in a shorter time.
### Summary
This paper systematically studied the impact of model expansion on continuous control tasks and proposed the BRO algorithm, proving that with appropriate regularization and technical support, model expansion can significantly improve sample efficiency and performance. This finding provides a new direction for future research, especially when dealing with complex physical control tasks.