Abstract:A foundational machine-learning architecture is reinforcement learning, where an outstanding problem is achieving an optimal balance between exploration and exploitation. Specifically, exploration enables the agents to discover optimal policies in unknown domains of the environment for gaining potentially large future rewards, while exploitation relies on the already acquired knowledge to maximize the immediate rewards. We articulate an approach to this problem, treating the dynamical process of reinforcement learning as a Markov decision process that can be modeled as a nondeterministic finite automaton and defining a subset of states in the automaton to represent the preference for exploring unknown domains of the environment. Exploration is prioritized by assigning higher transition probabilities to these states. We derive a mathematical framework to systematically balance exploration and exploitation by formulating it as a mixed integer programming (MIP) problem to optimize the agent's actions and maximize the discovery of novel preferential states. Solving the MIP problem provides a trade-off point between exploiting known states and exploring unexplored regions. We validate the framework computationally with a benchmark system and argue that the articulated automaton is effectively an adaptive network with a time-varying connection matrix, where the states in the automaton are nodes and the transitions among the states represent the edges. The network is adaptive because the transition probabilities evolve over time. The established connection between the adaptive automaton arising from reinforcement learning and the adaptive network opens the door to applying theories of complex dynamical networks to address frontier problems in machine learning and artificial intelligence.

A Preference-based Reinforcement Learning Approach Using Reward Exploration for Decision Making

Reward Uncertainty for Exploration in Preference-based Reinforcement Learning

Preference-Guided Reinforcement Learning for Efficient Exploration

Weak Human Preference Supervision for Deep Reinforcement Learning

Reinforcement Learning from Diverse Human Preferences

Beyond Human Preferences: Exploring Reinforcement Learning Trajectory Evaluation and Improvement through LLMs

Optimal Exploration Algorithm of Multi-Agent Reinforcement Learning Methods (Student Abstract)

Single Trajectory Learning: Exploration Versus Exploitation.

Hybrid Reinforcement Learning Based on Human Preference and Advice for Efficient Robot Skill Learning

Adaptive Preference Scaling for Reinforcement Learning with Human Feedback

Models of human preference for learning reward functions

Playing games with reinforcement learning via perceiving orientation and exploring diversity

Human-in-the-loop: Provably Efficient Preference-based Reinforcement Learning with General Function Approximation.

Rating-based Reinforcement Learning

Provable Reward-Agnostic Preference-Based Reinforcement Learning

Offline Reward Shaping with Scaling Human Preference Feedback for Deep Reinforcement Learning

Keep Various Trajectories: Promoting Exploration of Ensemble Policies in Continuous Control

Tell me why: Training preferences-based RL with human preferences and step-level explanations

Deep reinforcement learning from human preferences

Adaptive network approach to exploration-exploitation trade-off in reinforcement learning

Hindsight Preference Learning for Offline Preference-based Reinforcement Learning