Abstract:Artificial intelligence (AI) and especially reinforcement learning (RL) have the potential to enable agents to learn and perform tasks autonomously with superhuman performance. However, we consider RL as fundamentally a Human-in-the-Loop (HITL) paradigm, even when an agent eventually performs its task autonomously. In cases where the reward function is challenging or impossible to define, HITL approaches are considered particularly advantageous. The application of Reinforcement Learning from Human Feedback (RLHF) in systems such as ChatGPT demonstrates the effectiveness of optimizing for user experience and integrating their feedback into the training loop. In HITL RL, human input is integrated during the agent's learning process, allowing iterative updates and fine-tuning based on human feedback, thus enhancing the agent's performance. Since the human is an essential part of this process, we argue that human-centric approaches are the key to successful RL, a fact that has not been adequately considered in the existing literature. This paper aims to inform readers about current explainability methods in HITL RL. It also shows how the application of explainable AI (xAI) and specific improvements to existing explainability approaches can enable a better human-agent interaction in HITL RL for all types of users, whether for lay people, domain experts, or machine learning specialists. Accounting for the workflow in HITL RL and based on software and machine learning methodologies, this article identifies four phases for human involvement for creating HITL RL systems: (1) Agent Development, (2) Agent Learning, (3) Agent Evaluation, and (4) Agent Deployment. We highlight human involvement, explanation requirements, new challenges, and goals for each phase. We furthermore identify low-risk, high-return opportunities for explainability research in HITL RL and present long-term research goals to advance the field. Finally, we propose a vision of human-robot collaboration that allows both parties to reach their full potential and cooperate effectively.

Reinforcement Learning from AI Feedback A Review

RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

A Critical Evaluation of AI Feedback for Aligning Large Language Models

HRLAIF: Improvements in Helpfulness and Harmlessness in Open-domain Reinforcement Learning From AI Feedback

A Survey of Reinforcement Learning from Human Feedback

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

RLHF Deciphered: A Critical Analysis of Reinforcement Learning from Human Feedback for LLMs

Multi-objective Reinforcement learning from AI Feedback

RLSF: Reinforcement Learning via Symbolic Feedback

The History and Risks of Reinforcement Learning and Human Feedback

AI Alignment through Reinforcement Learning from Human Feedback? Contradictions and Limitations

Fine-tuning a LLM using Reinforcement Learning from Human Feedback for a Therapy Chatbot Application

Perspectives on the Social Impacts of Reinforcement Learning with Human Feedback

Data-Efficient Alignment of Large Language Models with Human Feedback Through Natural Language

Self-Evolved Reward Learning for LLMs

Constitutional AI: Harmlessness from AI Feedback

Human-in-the-Loop Reinforcement Learning: A Survey and Position on Requirements, Challenges, and Opportunities

Applying RLAIF for Code Generation with API-usage in Lightweight LLMs

Aligning Large Language Models from Self-Reference AI Feedback with one General Principle

Generative Reward Models