Abstract:Sample efficiency in Reinforcement Learning (RL) has traditionally been driven by algorithmic enhancements. In this work, we demonstrate that scaling can also lead to substantial improvements. We conduct a thorough investigation into the interplay of scaling model capacity and domain-specific RL enhancements. These empirical findings inform the design choices underlying our proposed BRO (Bigger, Regularized, Optimistic) algorithm. The key innovation behind BRO is that strong regularization allows for effective scaling of the critic networks, which, paired with optimistic exploration, leads to superior performance. BRO achieves state-of-the-art results, significantly outperforming the leading model-based and model-free algorithms across 40 complex tasks from the DeepMind Control, MetaWorld, and MyoSuite benchmarks. BRO is the first model-free algorithm to achieve near-optimal policies in the notoriously challenging Dog and Humanoid tasks.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **In continuous control tasks, how to significantly improve the sample efficiency of Reinforcement Learning (RL) through model scaling**. Specifically, the authors challenged the traditional practice of relying on algorithm improvements to improve sample efficiency and proposed that by expanding the model capacity and introducing domain - specific RL enhancement techniques, better performance can be achieved in complex tasks. ### Main problems and goals 1. **Limitations of traditional methods**: - Previous research has mainly focused on improving sample efficiency through algorithm improvements (such as solving value overestimation, exploration strategies, etc.). - However, these methods usually rely on smaller network architectures and may not be able to fully utilize the advantages of large - scale models. 2. **Possibilities of model expansion**: - The paper explored whether significantly improving performance can be achieved by increasing the number of model parameters and the number of gradient updates (i.e., expanding the model capacity and the proportion of the replay buffer). - Especially in continuous control tasks, whether this method can surpass the existing model - free and model - based methods. 3. **Key innovation points**: - A new algorithm, BRO (Bigger, Regularized, Optimistic), was introduced, which combines a larger critic network, strong regularization, and optimistic exploration. - Through extensive experimental verification, BRO can significantly outperform existing methods in multiple complex benchmarks, especially on benchmarks such as DeepMind Control, MetaWorld, and MyoSuite. ### Core elements of the solution - **Bigger**: Use a larger - scale critic network (about 5M parameters), which is about 7 times larger than the traditional SAC model. - **Regularized**: Adopt the BroNet architecture, including Layer Normalization, weight decay, and full - parameter resets, to ensure the stability and performance of the expanded model. - **Optimistic**: Utilize dual - policy optimistic exploration and non - pessimistic quantile Q - value approximation to balance exploration and exploitation. ### Experimental results - BRO significantly outperforms the existing model - free and model - based methods in 40 complex tasks, especially achieving near - optimal policies in Dog and Humanoid tasks. - The BRO (Fast) version significantly improves computational efficiency while maintaining sample efficiency and can complete training in a shorter time. ### Summary This paper systematically studied the impact of model expansion on continuous control tasks and proposed the BRO algorithm, proving that with appropriate regularization and technical support, model expansion can significantly improve sample efficiency and performance. This finding provides a new direction for future research, especially when dealing with complex physical control tasks.

Bigger, Regularized, Optimistic: scaling for compute and sample-efficient continuous control

Scaling Population-Based Reinforcement Learning with GPU Accelerated Simulation

REBEL: A Regularization-Based Solution for Reward Overoptimization in Robotic Reinforcement Learning from Human Feedback

Provably Robust Blackbox Optimization for Reinforcement Learning

Overestimation, Overfitting, and Plasticity in Actor-Critic: the Bitter Lesson of Reinforcement Learning

Algorithmic Framework for Model-based Deep Reinforcement Learning with Theoretical Guarantees

Bayesian Optimistic Optimization: Optimistic Exploration for Model-based Reinforcement Learning

Conservative Bayesian Model-Based Value Expansion for Offline Policy Optimization

Robust Reinforcement Learning for Continuous Control with Model Misspecification

Robust Predictable Control

Evolving Pareto-Optimal Actor-Critic Algorithms for Generalizability and Stability

Monotonic Robust Policy Optimization with Model Discrepancy.

Cautious Bayesian Optimization for Efficient and Scalable Policy Search

Benchmarking Smoothness and Reducing High-Frequency Oscillations in Continuous Control Policies

Conformal Symplectic Optimization for Stable Reinforcement Learning

Efficient Reinforcement Learning via Decoupling Exploration and Utilization

Optimistic Model Rollouts for Pessimistic Offline Policy Optimization

Bigger, Better, Faster: Human-level Atari with human-level efficiency

Towards model-free RL algorithms that scale well with unstructured data

Symmetric Reinforcement Learning Loss for Robust Learning on Diverse Tasks and Model Scales

Bridging RL Theory and Practice with the Effective Horizon