Abstract:The option framework, one of the most promising Hierarchical Reinforcement Learning (HRL) frameworks, is developed based on the Semi-Markov Decision Problem (SMDP) and employs a triple formulation of the option (i.e., an action policy, a termination probability, and an initiation set). These design choices, however, mean that the option framework: 1) has low sample efficiency, 2) cannot use more stable Markov Decision Problem (MDP) based learning algorithms, 3) represents abstract actions implicitly, and 4) is expensive to scale up. To overcome these problems, here we propose a simple yet effective MDP implementation of the option framework: the Skill-Action (SA) architecture. Derived from a novel discovery that the SMDP option framework has an MDP equivalence, SA hierarchically extracts skills (abstract actions) from primary actions and explicitly encodes these knowledge into skill context vectors (embedding vectors). Although SA is MDP formulated, skills can still be temporally extended by applying the attention mechanism to skill context vectors. Unlike the option framework, which requires M action policies for M skills, SA's action policy only needs one decoder to decode skill context vectors into primary actions. Under this formulation, SA can be optimized with any MDP based policy gradient algorithm. Moreover, it is sample efficient, cheap to scale up, and theoretically proven to have lower variance. Our empirical studies on challenging infinite horizon robot simulation environments demonstrate that SA not only outperforms all baselines by a large margin, but also exhibits smaller variance, faster convergence, and good interpretability. On transfer learning tasks, SA also outperforms the other models and shows its advantage on reusing knowledge across tasks. A potential impact of SA is to pave the way for a large scale pre-training architecture in the reinforcement learning area.

Action abstractions for amortized sampling

Graph learning-based generation of abstractions for reinforcement learning

Leveraging exploration in off-policy algorithms via normalizing flows

Sample Efficient Deep Reinforcement Learning with Online State Abstraction and Causal Transformer Model Prediction

Guiding the search in continuous state-action spaces by learning an action sampling distribution from off-target samples

FlowPG: Action-constrained Policy Gradient with Normalizing Flows

Learning Action Representations for Reinforcement Learning

GenPlan: Generative sequence models as adaptive planners

Learning Planning Abstractions from Language

Abstract Value Iteration for Hierarchical Reinforcement Learning

Geometric Active Exploration in Markov Decision Processes: the Benefit of Abstraction

Generative Flow Networks as Entropy-Regularized RL

QGFN: Controllable Greediness with Action Values

The Skill-Action Architecture: Learning Abstract Action Embeddings for Reinforcement Learning

Learning Action-Transferable Policy with Action Embedding

Deep RL with Hierarchical Action Exploration for Dialogue Generation

Generalizable Policy Improvement Via Reinforcement Sampling (student Abstract)

Improving exploration efficiency of deep reinforcement learning through samples produced by generative model

Achieving Sample and Computational Efficient Reinforcement Learning by Action Space Reduction via Grouping

Simple Emergent Action Representations from Multi-Task Policy Training

Trajectory Planning with Deep Reinforcement Learning in High-Level Action Spaces