Trajectory Sampling Value Iteration: Improved Dyna Search for MDPs

Yicheng Zhou,Quan Liu,Qiming Fu,Zongzhang Zhang
DOI: https://doi.org/10.5555/2772879.2773385
2015-01-01
Abstract:Traditional online learning algorithms often suffer from the lack of convergence rate and accuracy. The Dyna-2 framework, combining learning with searching methods, provides a way of alleviating the problem. The main idea behind it is to execute a simulation-based search that helps the learning process to select better actions. The search process relies on a simulated model of the environment that is built during learning. However, the model is not fully used in Dyna-2. To provide better solution quality, our paper improves the algorithm by applying value iteration, a model-based dynamic programming algorithm, to the search process with a trajectory sampling approach (DynaTSVI). Trajectory sampling is used to reduce high time complexity caused by dynamic programming. Experimentally, we use the Dyna Maze and the Windy Grid World tasks to analyze the proposed method in several aspects. Our results show that DynaTSVI outperforms Dyna-2 in both deterministic and stochastic environments in terms of convergence rate and accuracy.
What problem does this paper attempt to address?