DROP: Conservative Model-based Optimization for Offline Reinforcement Learning
Jinxin Liu,Hongyin Zhang,Zifeng Zhuang,Yachen Kang,Donglin Wang,Bin Wang,Jianye HAO
2023-01-01
Abstract:In this work, we decouple the iterative (bi-level) offline RL optimization from the offline training phase, forming a non-iterative bi-level learning paradigm that avoids the iterative error propagation over two levels. Specifically, this non-iterative paradigm allows us to conduct inner-level optimization in training (ie, employing policy/value regularization), while performing outer-level optimization in testing (ie, conducting policy inference). Naturally, such paradigm raises three core questions (that are not fully answered by prior non-iterative methods): (Q1) What information should we transfer from inner-level to outer-level? (Q2) What should we pay attention to when using the transferred information in outer-level optimization? (Q3) What are the benefits of concurrently conducting outer-level optimization during testing? Motivated by model-based optimization, we proposed DROP, which fully answered the above three questions. Particularly, in inner-level, DROP decomposes offline data into multiple subsets, and learns a score model (Q1). To keep safe exploitation to score model in outer-level, we explicitly learn a behavior embedding and introduce a conservative regularization (Q2). During testing, we show that DROP permits deployment adaptation, enabling an adaptive inference across states (Q3). Empirically, we evaluate DROP on various benchmarks, showing that DROP gains comparable or better performance compared to prior offline RL methods.