Abstract:Learning policies from high-dimensional visual inputs, such as pixels and point clouds, is crucial in various applications. Visual reinforcement learning is a promising approach that directly trains policies from visual observations, although it faces challenges in sample efficiency and computational costs. This study conducts an empirical comparison of State-to-Visual DAgger, a two-stage framework that initially trains a state policy before adopting online imitation to learn a visual policy, and Visual RL across a diverse set of tasks. We evaluate both methods across 16 tasks from three benchmarks, focusing on their asymptotic performance, sample efficiency, and computational costs. Surprisingly, our findings reveal that State-to-Visual DAgger does not universally outperform Visual RL but shows significant advantages in challenging tasks, offering more consistent performance. In contrast, its benefits in sample efficiency are less pronounced, although it often reduces the overall wall-clock time required for training. Based on our findings, we provide recommendations for practitioners and hope that our results contribute valuable perspectives for future research in visual policy learning.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper explores in visual strategy learning when the "State - to - Visual DAgger" method should be preferentially selected rather than directly using visual reinforcement learning (Visual RL). Specifically, the paper attempts to answer the following questions: 1. **Sample efficiency and computational cost**: Although visual reinforcement learning can directly learn strategies from high - dimensional visual inputs (such as pixels and point clouds), it usually faces the challenges of low sample efficiency and high computational cost. By comparing these two methods, the paper aims to find out in which tasks State - to - Visual DAgger can show advantages in these aspects. 2. **Asymptotic performance**: The paper evaluates the asymptotic performance of the two methods in different tasks to determine which method is better in the long - term performance. 3. **Impact of task difficulty**: It has been found that State - to - Visual DAgger shows significant advantages when dealing with complex tasks, while in simple tasks, its performance is comparable to or slightly inferior to that of visual reinforcement learning. Therefore, the paper attempts to clarify under what task difficulty State - to - Visual DAgger is more advantageous. 4. **Stability and consistency**: The paper also examines the stability and consistency of the two methods during the training process and finds that State - to - Visual DAgger provides more consistent and stable performance after convergence. ### Research background Visual reinforcement learning (Visual RL) is a key technology for learning strategies from high - dimensional visual inputs (such as images and point clouds) and has wide applications in fields such as robotic manipulation, navigation, and autonomous driving. However, Visual RL faces the problems of low sample efficiency and high computational cost. To solve these problems, researchers proposed the State - to - Visual DAgger method, which is divided into two stages: - **First stage**: Train a teacher strategy using low - dimensional state observations. - **Second stage**: Transfer the knowledge of the teacher strategy to the visual strategy through online imitation learning. ### Experimental setup To fairly compare these two methods, the author selected 16 tasks from three benchmarks, including: - **ManiSkill**: Involves tasks such as fixed and mobile robotic arm manipulation and two - arm coordination. - **DMControl**: Covers motion control and classical control tasks of different robot morphologies. - **Adroit**: Focuses on dexterous - hand manipulation tasks. ### Main findings 1. **Asymptotic performance**: - In difficult tasks, State - to - Visual DAgger significantly outperforms visual reinforcement learning. - In simple tasks, the performance of the two is comparable or visual RL is slightly better. 2. **Sample efficiency**: - In difficult tasks, State - to - Visual DAgger shows higher sample efficiency, mainly due to its better asymptotic performance. - In simple tasks, the sample efficiency of the two is comparable. 3. **Computational cost (wall - clock time)**: - State - to - Visual DAgger shows a significant time - efficiency advantage in most tasks, even in simple tasks. This is mainly because visual RL needs to train the visual encoder and render visual observations, while State - to - Visual DAgger only needs to perform these operations in the second stage. 4. **Stability and consistency**: - State - to - Visual DAgger provides more consistent and stable performance after convergence, especially in difficult tasks. ### Conclusions and recommendations Based on the above findings, the author provides the following suggestions for practitioners: - **When visual RL has difficulty solving problems**: For complex tasks, preferentially select State - to - Visual DAgger, use low - dimensional state information for effective strategy learning, and then transition to high - dimensional visual inputs. - **Existing state RL implementation**: If state RL has been implemented and low - dimensional state observations can be extracted or simulated, it can naturally transition to State - to - Visual DAgger.

When Should We Prefer State-to-Visual DAgger Over Visual Reinforcement Learning?

S2RL: DoWe Really Need to Perceive All States in Deep Multi-Agent Reinforcement Learning?

Value-Consistent Representation Learning for Data-Efficient Reinforcement Learning

A Comprehensive Survey of Data Augmentation in Visual Reinforcement Learning

Generalizable Visual Reinforcement Learning with Segment Anything Model

Generalizing Consistency Policy to Visual RL with Prioritized Proximal Experience Regularization

Symbolic Visual Reinforcement Learning: A Scalable Framework with Object-Level Abstraction and Differentiable Expression Search

Understanding What Affects the Generalization Gap in Visual Reinforcement Learning: Theory and Empirical Evidence

A stable data-augmented reinforcement learning method with ensemble exploration and exploitation

Stabilizing Visual Reinforcement Learning Via Asymmetric Interactive Cooperation

Challenges and Opportunities in Offline Reinforcement Learning from Visual Observations

Effective Representation Learning is More Effective in Reinforcement Learning Than You Think

DMC-VB: A Benchmark for Representation Learning for Control with Visual Distractors

Towards Understanding How to Reduce Generalization Gap in Visual Reinforcement Learning.

Generalization Enhancement of Visual Reinforcement Learning through Internal States

DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning

RLIF: Interactive Imitation Learning as Reinforcement Learning

Improving Reinforcement Learning Efficiency with Auxiliary Tasks in Non-Visual Environments: A Comparison

On the Efficacy of 3D Point Cloud Reinforcement Learning

Learning Better with Less: Effective Augmentation for Sample-Efficient Visual Reinforcement Learning