Partially Observable RL with B-Stability: Unified Structural Condition and Sharp Sample-Efficient Algorithms

Fan Chen,Yu Bai,Song Mei
DOI: https://doi.org/10.48550/arXiv.2209.14990
2022-12-16
Abstract:Partial Observability -- where agents can only observe partial information about the true underlying state of the system -- is ubiquitous in real-world applications of Reinforcement Learning (RL). Theoretically, learning a near-optimal policy under partial observability is known to be hard in the worst case due to an exponential sample complexity lower bound. Recent work has identified several tractable subclasses that are learnable with polynomial samples, such as Partially Observable Markov Decision Processes (POMDPs) with certain revealing or decodability conditions. However, this line of research is still in its infancy, where (1) unified structural conditions enabling sample-efficient learning are lacking; (2) existing sample complexities for known tractable subclasses are far from sharp; and (3) fewer sample-efficient algorithms are available than in fully observable RL. This paper advances all three aspects above for Partially Observable RL in the general setting of Predictive State Representations (PSRs). First, we propose a natural and unified structural condition for PSRs called \emph{B-stability}. B-stable PSRs encompasses the vast majority of known tractable subclasses such as weakly revealing POMDPs, low-rank future-sufficient POMDPs, decodable POMDPs, and regular PSRs. Next, we show that any B-stable PSR can be learned with polynomial samples in relevant problem parameters. When instantiated in the aforementioned subclasses, our sample complexities improve substantially over the current best ones. Finally, our results are achieved by three algorithms simultaneously: Optimistic Maximum Likelihood Estimation, Estimation-to-Decisions, and Model-Based Optimistic Posterior Sampling. The latter two algorithms are new for sample-efficient learning of POMDPs/PSRs.
Machine Learning,Artificial Intelligence,Statistics Theory
What problem does this paper attempt to address?
This paper attempts to solve several key problems in Partially Observable Reinforcement Learning (PORL). Specifically, the article mainly focuses on the following aspects: 1. **Lack of unified structural conditions**: Most of the existing research analyzes specific sub - classes, lacking a unified structural condition for sample - efficient algorithm design. This paper proposes a new condition named B - stability, which covers most of the known tractable sub - classes. 2. **Inaccurate existing sample complexity**: For the known tractable sub - classes, the existing sample complexity estimates often involve large polynomial factors and may be far from optimal. By introducing the B - stability condition, this paper significantly improves the sample complexity of these sub - classes. 3. **Lack of sample - efficient algorithms**: In partially observable environments, compared with fully observable environments, there are fewer sample - efficient algorithms. This paper designs three new efficient algorithms for partially observable environments: Optimistic Maximum Likelihood Estimation (OMLE), Explorative Estimation - to - Decisions (Explorative E2D), and Model - Based Optimistic Posterior Sampling (MOPS). ### Specific contributions of the paper 1. **Proposing the B - stability condition**: - B - stability requires that the B - representation (or observable operator) of PSR is bounded under an appropriate operator norm. - This condition covers most of the known tractable sub - classes, such as weakly - revealing POMDP, low - rank future - sufficient POMDP, decodable POMDP, and regular PSR. 2. **Sample - efficient algorithm design**: - It is proved that any B - stable PSR can be learned by the above three algorithms with polynomial sample complexity, and these algorithms show significantly better performance than existing methods in specific sub - classes. - Among them, Explorative E2D and MOPS are the first algorithms proven to be sample - efficient in partially observable environments. 3. **Improving sample complexity**: - In regular PSR and known tractable sub - classes, the algorithms in this paper significantly improve the current best results. For example, in the m - step αrev - revealing POMDP, the algorithm in this paper can find an approximately optimal policy in a shorter time. 4. **Full - policy - model estimation**: - A variant of the E2D algorithm is designed for full - policy - model estimation of B - stable PSR, and from this, the guarantee of reward - free learning is derived. 5. **Technical contributions**: - A new generalized ℓ2 - type Eluder argument is proposed, combined with a fine - grained error decomposition of B - stable PSR, making the sample complexity estimate more accurate. ### Comparison with related work - **POMDP learning**: The work in this paper is significantly superior to other works in terms of sample complexity, especially when dealing with various structural conditions (such as revealing conditions, low - rank conditions, etc.). - **PSR learning**: Compared with recent works, this paper not only provides broader condition coverage but also obtains better sample complexity through more rigorous analysis. In conclusion, by introducing the B - stability condition and designing new efficient algorithms, this paper significantly advances the research in partially observable reinforcement learning and solves some key problems in existing methods.