Abstract:In decision-making scenarios, \textit{reasoning} can be viewed as an algorithm $P$ that makes a choice of an action $a^* \in \mathcal{A}$, aiming to optimize some outcome such as maximizing the value function of a Markov decision process (MDP). However, executing $P$ itself may bear some costs (time, energy, limited capacity, etc.) and needs to be considered alongside explicit utility obtained by making the choice in the underlying decision problem. Such costs need to be taken into account in order to accurately model human behavior, as well as optimizing AI planning, as all physical systems are bound to face resource constraints. Finding the right $P$ can itself be framed as an optimization problem over the space of reasoning processes $P$, generally referred to as \textit{metareasoning}. Conventionally, human metareasoning models assume that the agent knows the transition and reward distributions of the underlying MDP. This paper generalizes such models by proposing a meta Bayes-Adaptive MDP (meta-BAMDP) framework to handle metareasoning in environments with unknown reward/transition distributions, which encompasses a far larger and more realistic set of planning problems that humans and AI systems face. As a first step, we apply the framework to two-armed Bernoulli bandit (TABB) tasks, which have often been used to study human decision making. Owing to the meta problem's complexity, our solutions are necessarily approximate, but nevertheless robust within a range of assumptions that are arguably realistic for human decision-making scenarios. These results offer a normative framework for understanding human exploration under cognitive constraints. This integration of Bayesian adaptive strategies with metareasoning enriches both the theoretical landscape of decision-making research and practical applications in designing AI systems that plan under uncertainty and resource constraints.

Bandit Models of Human Behavior: Reward Processing in Mental Disorders

Unified Models of Human Behavioral Agents in Bandits, Contextual Bandits and RL

Model-based reward prediction in the primate prefrontal cortex

Bad Values but Good Behavior: Learning Highly Misspecified Bandits and MDPs

Model of a striatal circuit exploring biological mechanisms underlying decision-making during normal and disordered states

A Semiparametric Inverse Reinforcement Learning Approach to Characterize Decision Making for Mental Disorders

HMM for Discovering Decision-Making Dynamics Using Reinforcement Learning Experiments

Partially Observable Contextual Bandits with Linear Payoffs

Multi-armed Bandit Models for the Optimal Design of Clinical Trials: Benefits and Challenges

Learning Modular Safe Policies in the Bandit Setting with Application to Adaptive Clinical Trials

On the Sensitivity of Reward Inference to Misspecified Human Models

Risk-Averse Biased Human Policies in Assistive Multi-Armed Bandit Settings

More widespread and rigid neuronal representation of reward expectation underlies impulsive choices

Modeling sensory-motor decisions in natural behavior

Metareasoning in uncertain environments: a meta-BAMDP framework

Risk-averse Contextual Multi-armed Bandit Problem with Linear Payoffs

Non-stationary Bandits with Habituation and Recovery Dynamics and Knapsack Constraints

Contextual Multi-armed Bandit Algorithm for Semiparametric Reward Model

Thompson Sampling in Partially Observable Contextual Bandits

A General Framework for Bandit Problems Beyond Cumulative Objectives

Neural Basis of Reward Anticipation and Its Genetic Determinants