Abstract:When facing an unfamiliar environment, animals need to explore to gain new knowledge about which actions provide reward, but also put the newly acquired knowledge to use as quickly as possible. Optimal reinforcement learning strategies should therefore assess the uncertainties of these action–reward associations and utilise them to inform decision making. We propose a novel model whereby direct and indirect striatal pathways act together to estimate both the mean and variance of reward distributions, and mesolimbic dopaminergic neurons provide transient novelty signals, facilitating effective uncertainty-driven exploration. We utilised electrophysiological recording data to verify our model of the basal ganglia, and we fitted exploration strategies derived from the neural model to data from behavioural experiments. We also compared the performance of directed exploration strategies inspired by our basal ganglia model with other exploration algorithms including classic variants of upper confidence bound (UCB) strategy in simulation. The exploration strategies inspired by the basal ganglia model can achieve overall superior performance in simulation, and we found qualitatively similar results in fitting model to behavioural data compared with the fitting of more idealised normative models with less implementation level detail. Overall, our results suggest that transient dopamine levels in the basal ganglia that encode novelty could contribute to an uncertainty representation which efficiently drives exploration in reinforcement learning. Humans and other animals learn from rewards and losses resulting from their actions to maximise their chances of survival. In many cases, a trial-and-error process is necessary to determine the most rewarding action in a certain context. During this process, determining how much resource should be allocated to acquiring information ("exploration") and how much should be allocated to utilising the existing information to maximise reward ("exploitation") is key to the overall effectiveness, i.e., the maximisation of total reward obtained with a certain amount of effort. We propose a theory whereby an area within the mammalian brain called the basal ganglia integrates current knowledge about the mean reward, reward uncertainty and novelty of an action in order to implement an algorithm which optimally allocates resources between exploration and exploitation. We verify our theory using behavioural experiments and electrophysiological recording, and show in simulations that the model also achieves good performance in comparison with established benchmark algorithms.

How cortico-basal ganglia-thalamic subnetworks can shift decision policies to maximize reward rate

Competing neural representations of choice shape evidence accumulation in humans

Neural circuit models for evidence accumulation through choice-selective sequences

Policy adjustment in a dynamic economic game

Reward Bases: A simple mechanism for adaptive acquisition of multiple reward types

Transient dopamine response on medium spiny neuron subtypes in switching approach-avoidance outcomes against action bias - A framework for exploration in action selection

Decomposed frontal corticostriatal ensemble activity changes across trials, revealing distinct features relevant to outcome-based decision making

Dopamine encoding of novelty facilitates efficient uncertainty-driven exploration

Dynamical model of salience gated working memory, action selection and reinforcement based on basal ganglia and dopamine feedback

Basal ganglia role in learning rewarded actions and executing previously learned choices: Healthy and diseased states

A Computational Theory of Learning Flexible Reward-Seeking Behavior with Place Cells

The Dopaminergic Midbrain Encodes the Expected Certainty about Desired Outcomes

Action-modulated midbrain dopamine activity arises from distributed control policies

The hippocampal-striatal circuit for goal-directed and habitual choice

Dopamine-independent effect of rewards on choices through hidden-state inference

Synaptic and Spiking Dynamics Underlying Reward Reversal in the Orbitofrontal Cortex.

Mechanisms of Hierarchical Reinforcement Learning in Corticostriatal Circuits 1: Computational Analysis

Ventral tegmental area dopamine neural activity switches simultaneously with rule representations in the prefrontal cortex and hippocampus

How Instructed Knowledge Modulates the Neural Systems of Reward Learning

Neural Representations of Post-Decision Accuracy and Reward Expectation in the Caudate Nucleus and Frontal Eye Field

Optimization of decision making in multilayer networks: the role of locus coeruleus