Dynamical model of salience gated working memory, action selection and reinforcement based on basal ganglia and dopamine feedback
Adam Ponzi
DOI: https://doi.org/10.1016/j.neunet.2007.12.040
Abstract:A simple working memory model based on recurrent network activation is proposed and its application to selection and reinforcement of an action is demonstrated as a solution to the temporal credit assignment problem. Reactivation of recent salient cue states is generated and maintained as a type of salience gated recurrently active working memory, while lower salience distractors are ignored. Cue reactivation during the action selection period allows the cue to select an action while its reactivation at the reward period allows the reinforcement of the action selected by the reactivated state, which is necessarily the action which led to the reward being found. A down-gating of the external input during the reactivation and maintenance prevents interference. A double winner-take-all system which selects only one cue and only one action allows the targeting of the cue-action allocation to be modified. This targeting works both to reinforce a correct cue-action allocation and to punish the allocation when cue-action allocations change. Here we suggest a firing rate neural network implementation of this system based on the basal ganglia anatomy with input from a cortical association layer where reactivations are generated by signals from the thalamus. Striatum medium spiny neurons represent actions. Auto-catalytic feedback from a dopamine reward signal modulates three-way Hebbian long term potentiation and depression at the cortical-striatal synapses which represent the cue-action associations. The model is illustrated by the numerical simulations of a simple example--that of associating a cue signal to a correct action to obtain reward after a delay period, typical of primate cue reward tasks. Through learning, the model shows a transition from an exploratory phase where actions are generated randomly, to a stable directed phase where the animal always chooses the correct action for each experienced state. When cue-action allocations change, we show that this is noticed by the model, the incorrect cue-action allocations are punished and the correct ones discovered.
What problem does this paper attempt to address?