Decision Confidence and Outcome Variability Optimally Regulate Separate Aspects of Hyperparameter Setting

Kobe Desender,Tom Verguts
DOI: https://doi.org/10.1101/2024.10.03.616475
2024-10-04
Abstract:Reinforcement learning models describe how agents learn about the world (value learning), and how they interact with their environment based on the learned information (decision policy). As in any optimization problem, it is important to set the process hyperparameters, a process which also is thought to be learned (meta-learning). Here, we test a key prediction of meta-learning frameworks, namely that there exist one or more meta-signals that govern hyperparameter setting. Specifically, we test whether decision confidence, in a context of varying outcome variability, informs hyperparameter setting. Participants performed a 2-armed bandit task with confidence ratings. Model comparison shows that confidence and outcome variability are differentially involved in hyperparameter setting. A high level of confidence in the previous choice decreased hyperparameter setting of decision noise on the current trial: when a trial was made with low confidence, the choice on the next trial tended to be more explorative (i.e. high decision noise). Outcome variability influenced another hyperparameter, the learning rate for positive prediction errors (thus affecting value learning). Both strategies are rational approaches that maximize earnings at different temporal loci: the modulation by confidence causes more frequent exploration early after a change point, the modulation by outcome variability is advantageous late after a change point. Finally, we show that (reported) confidence in value-based choices reflects the action value of the chosen option (irrespective of the unchosen value). In sum, decision confidence and outcome variability reflect distinct signals that optimally guide the setting of hyperparameters in decision policy and value learning, respectively.
Neuroscience
What problem does this paper attempt to address?
The paper attempts to address the issue of how decision confidence and outcome variability influence hyperparameter settings in the reinforcement learning process, and to explore whether these factors act as meta-signals to optimize value learning and decision policy. Specifically, the study investigates how participants' decision confidence affects their exploration and exploitation behavior in subsequent trials under different outcome variability conditions through a 2-armed bandit task. The study found that high levels of confidence reduce the setting of decision noise in the current trial, while outcome variability affects another hyperparameter—the learning rate for positive prediction errors—thereby influencing value learning. These two strategies maximize rewards on different time scales after environmental change points. Additionally, the study shows that reported confidence reflects the action value in value-based choices, regardless of the value of the unchosen option. In summary, decision confidence and outcome variability are two distinct signals that optimally guide hyperparameter settings in decision policy and value learning, respectively.