Gain-based Exploration: from Multi-armed Bandits to Partially Observable Environments.

Bailu Si,J. Michael Herrmann,Klaus Pawelzik
DOI: https://doi.org/10.1109/icnc.2007.395
2007-01-01
Abstract:We introduce gain-based policies for exploration in active learning problems. For exploration in multi-armed bandits with the knowledge of reward variances, an ideal gain-maximization exploration policy is described in a unified framework which also includes error-based and counter-based exploration. For realistic situations without prior knowledge of reward variances, we establish an upper bound on the gain function, resulting in a realistic gain-maximization exploration policy which achieves the optimal exploration asymptotically. Finally, we extend the gain-maximization exploration scheme towards partially observable environments. Approximating the environment by a set of local bandits, the agent actively selects its actions by maximizing discounted gain in learning local bandits. The resulting gain-based exploration not only outperforms random exploration, but also produces curiosity-driven behavior which is observed in natural agents.
What problem does this paper attempt to address?