Abstract:Classical value estimation reinforcement learning algorithms do not perform very well in dynamic environments. On the other hand, the reinforcement learning of animals is quite flexible: they can adapt to dynamic environments very quickly and deal with noisy inputs very effectively. One feature that may contribute to animals' good performance in dynamic environments is that they learn and perceive the time to reward. In this research, we attempt to learn and perceive the time to reward and explore situations where the learned time information can be used to improve the performance of the learning agent in dynamic environments. The type of dynamic environments that we are interested in is that type of switching environment which stays the same for a long time, then changes abruptly, and then holds for a long time before another change. The type of dynamics that we mainly focus on is the time to reward, though we also extend the ideas to learning and perceiving other criteria of optimality, e.g. the discounted return, so that they can still work even when the amount of reward may also change. Specifically, both the mean and variance of the time to reward are learned and then used to detect changes in the environment and to decide whether the agent should give up a suboptimal action. When a change in the environment is detected, the learning agent responds specifically to the change in order to recover quickly from it. When it is found that the current action is still worse than the optimal one, the agent gives up this time's exploration of the action and then remakes its decision in order to avoid longer than necessary exploration. The results of our experiments using two real-world problems show that they have effectively sped up learning, reduced the time taken to recover from environmental changes, and improved the performance of the agent after the learning converges in most of the test cases compared with classical value estimation reinforcement learning algorithms. In addition, we have successfully used spiking neurons to implement various phenomena of classical conditioning, the simplest form of animal reinforcement learning in dynamic environments, and also pointed out a possible implementation of instrumental conditioning and general reinforcement learning using similar models.

Optimizing Agent Behavior over Long Time Scales by Transporting Value

Time‐in‐action RL

Credit Assignment: Challenges and Opportunities in Developing Human-like AI Agents

Short-term Memory Traces for Action Bias in Human Reinforcement Learning

Intention Beyond Desire: Commitment in Human Action

Attention or memory? Neurointerpretable agents in space and time

Competitive Multi-agent Deep Reinforcement Learning with Counterfactual Thinking

Reward is not Necessary: How to Create a Modular & Compositional Self-Preserving Agent for Life-Long Learning

Towards Practical Credit Assignment for Deep Reinforcement Learning

Stable Hadamard Memory: Revitalizing Memory-Augmented Agents for Reinforcement Learning

Optimizing the Long-Term Average Reward for Continuing MDPs: A Technical Report

Reinforcement Learning with Time Perception

A Survey of Temporal Credit Assignment in Deep Reinforcement Learning

Scene Memory Transformer for Embodied Agents in Long-Horizon Tasks

Would I have gotten that reward? Long-term credit assignment by counterfactual contribution analysis

Behavioral decision-making of mobile robots simulating the memory consolidation mechanism of human brain

Humans rationally balance detailed and temporally abstract world models

Episodic Memory for Learning Subjective-Timescale Models

Scale-invariant temporal history (SITH): optimal slicing of the past in an uncertain world

When Do Transformers Shine in RL? Decoupling Memory from Credit Assignment

Episodic Future Thinking Mechanism for Multi-agent Reinforcement Learning