Abstract:Dealing with Partially Observable Markov Decision Processes is notably a challenging task. We face an average-reward infinite-horizon POMDP setting with an unknown transition model, where we assume the knowledge of the observation model. Under this assumption, we propose the Observation-Aware Spectral (OAS) estimation technique, which enables the POMDP parameters to be learned from samples collected using a belief-based policy. Then, we propose the OAS-UCRL algorithm that implicitly balances the exploration-exploitation trade-off following the $\textit{optimism in the face of uncertainty}$ principle. The algorithm runs through episodes of increasing length. For each episode, the optimal belief-based policy of the estimated POMDP interacts with the environment and collects samples that will be used in the next episode by the OAS estimation procedure to compute a new estimate of the POMDP parameters. Given the estimated model, an optimization oracle computes the new optimal policy. We show the consistency of the OAS procedure, and we prove a regret guarantee of order $\mathcal{O}(\sqrt{T \log(T)})$ for the proposed OAS-UCRL algorithm. We compare against the oracle playing the optimal stochastic belief-based policy and show the efficient scaling of our approach with respect to the dimensionality of the state, action, and observation space. We finally conduct numerical simulations to validate and compare the proposed technique with other baseline approaches.

Partially Observable Markov Decision Processes with Reward Information

What should be observed for optimal reward in POMDPs?

Recursively-Constrained Partially Observable Markov Decision Processes

Finding Optimal Memoryless Policies of POMDPs under the Expected Average Reward Criterion

Distributionally Robust Partially Observable Markov Decision Process with Moment-based Ambiguity

Explainable Finite-Memory Policies for Partially Observable Markov Decision Processes

Learning Interpretable Policies in Hindsight-Observable POMDPs through Partially Supervised Reinforcement Learning

Sample-Efficient Learning of POMDPs with Multiple Observations In Hindsight

ODE-based Recurrent Model-free Reinforcement Learning for POMDPs

Sublinear Regret for Learning POMDPs

Partially Observable Markov Decision Processes in Robotics: A Survey

Efficient Learning of POMDPs with Known Observation Model in Average-Reward Setting

End-to-End Policy Gradient Method for POMDPs and Explainable Agents

Robust Reward Design for Markov Decision Processes

Qualitative Analysis of Partially-observable Markov Decision Processes

Robust Reinforcement Learning in POMDPs with Incomplete and Noisy Observations

Robust Action Selection in Partially Observable Markov Decision Processes with Model Uncertainty

Partially Observable Markov Decision Processes and Performance Sensitivity Analysis

Provably Efficient Partially Observable Risk-Sensitive Reinforcement Learning with Hindsight Observation

Reinforcement learning algorithm for partially observable Markov decision problems

Provable Representation with Efficient Planning for Partial Observable Reinforcement Learning