Abstract:Partial monitoring games are repeated games where the learner receives feedback that might be different from adversary's move or even the reward gained by the learner. Recently, a general model of combinatorial partial monitoring (CPM) games was proposed \cite{lincombinatorial2014}, where the learner's action space can be exponentially large and adversary samples its moves from a bounded, continuous space, according to a fixed distribution. The paper gave a confidence bound based algorithm (GCB) that achieves $O(T^{2/3}\log T)$ distribution independent and $O(\log T)$ distribution dependent regret bounds. The implementation of their algorithm depends on two separate offline oracles and the distribution dependent regret additionally requires existence of a unique optimal action for the learner. Adopting their CPM model, our first contribution is a Phased Exploration with Greedy Exploitation (PEGE) algorithmic framework for the problem. Different algorithms within the framework achieve $O(T^{2/3}\sqrt{\log T})$ distribution independent and $O(\log^2 T)$ distribution dependent regret respectively. Crucially, our framework needs only the simpler "argmax" oracle from GCB and the distribution dependent regret does not require existence of a unique optimal action. Our second contribution is another algorithm, PEGE2, which combines gap estimation with a PEGE algorithm, to achieve an $O(\log T)$ regret bound, matching the GCB guarantee but removing the dependence on size of the learner's action space. However, like GCB, PEGE2 requires access to both offline oracles and the existence of a unique optimal action. Finally, we discuss how our algorithm can be efficiently applied to a CPM problem of practical interest: namely, online ranking with feedback at the top.

Exploration Analysis in Finite-Horizon Turn-based Stochastic Games.

Optimal Exploration Algorithm of Multi-Agent Reinforcement Learning Methods (Student Abstract)

Regret Bounds and Reinforcement Learning Exploration of EXP-based Algorithms

The Uncertainty Bellman Equation and Exploration

Bounded Optimal Exploration in MDP

Success Probability of Exploration: a Concrete Analysis of Learning Efficiency

Phased Exploration with Greedy Exploitation in Stochastic Combinatorial Partial Monitoring Games

Temporal Induced Self-Play for Stochastic Bayesian Games

More Efficient Randomized Exploration for Reinforcement Learning via Approximate Sampling

Robust optimal policies for team Markov games

Optimal Control of Robust Team Stochastic Games

Best Action Selection In A Stochastic Environment

Playing Against Fair Adversaries in Stochastic Games with Total Rewards

On convergence rates of game theoretic reinforcement learning algorithms

Exploration for Free: How Does Reward Heterogeneity Improve Regret in Cooperative Multi-agent Bandits?

Temporal Induced Self-Play for Stochastic Bayesian Games.

ApproxED: Approximate exploitability descent via learned best responses

Optimistic Thompson Sampling for No-Regret Learning in Unknown Games

A Unified Perspective on Deep Equilibrium Finding

Convergence to Nash Equilibrium and No-regret Guarantee in (Markov) Potential Games

Provably Efficient Fictitious Play Policy Optimization for Zero-Sum Markov Games with Structured Transitions