Abstract:While reinforcement learning (RL) algorithms have been successfully applied to numerous tasks, their reliance on neural networks makes their behavior difficult to understand and trust. Counterfactual explanations are human-friendly explanations that offer users actionable advice on how to alter the model inputs to achieve the desired output from a black-box system. However, current approaches to generating counterfactuals in RL ignore the stochastic and sequential nature of RL tasks and can produce counterfactuals that are difficult to obtain or do not deliver the desired outcome. In this work, we propose RACCER, the first RL-specific approach to generating counterfactual explanations for the behavior of RL agents. We first propose and implement a set of RL-specific counterfactual properties that ensure easily reachable counterfactuals with highly probable desired outcomes. We use a heuristic tree search of the agent's execution trajectories to find the most suitable counterfactuals based on the defined properties. We evaluate RACCER in two tasks as well as conduct a user study to show that RL-specific counterfactuals help users better understand agents' behavior compared to the current state-of-the-art approaches.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to generate interpretable and feasible counterfactual explanations for Reinforcement Learning (RL) to help users better understand and trust the behavior of RL models. Specifically, the existing counterfactual explanation methods have the following problems when dealing with RL tasks: 1. **Ignoring sequentiality and randomness**: The existing methods do not fully consider the sequentiality of RL tasks and the randomness of the environment, resulting in generated counterfactual explanations that may be difficult to implement or cannot guarantee the expected results. 2. **Insufficient feature similarity**: In RL, although two states have similar features, they may be far apart during execution. Therefore, methods that rely solely on feature similarity may generate infeasible counterfactual explanations. 3. **Lack of specificity**: The existing methods fail to distinguish between counterfactual explanations of past causes and future actions, resulting in explanations that are not specific and useful enough. To solve these problems, the paper proposes RACCER (Reachable and Certain Counterfactual Explanations for Reinforcement Learning), which is the first counterfactual explanation method specifically designed for RL. RACCER ensures that the generated counterfactual explanations are both easy to implement and can produce the expected results with a high probability by introducing three RL - specific counterfactual properties - reachability, stochastic certainty, and fidelity. ### Main contributions 1. **Proposing three RL - specific counterfactual properties**: reachability, stochastic certainty, and fidelity, and providing evaluation metrics for these properties. 2. **Designing the RACCER algorithm**: This algorithm generates RL - specific counterfactual explanations based on the above properties, can be applied to any RL model, and does not require access to the internal parameters of the model. 3. **User study**: Through user experiments, it is verified that the counterfactual explanations generated by RACCER can help users better understand the behavior of RL agents, and the effect is better than that of the existing methods. ### Summary of mathematical formulas - **Reachability**: \[ R(x, A)=\text{len}(A) \] where \( R(x, A) \) represents the length of the action sequence \( A \) required to move from state \( x \) to the counterfactual state \( x' \). - **Fidelity**: \[ F(x, A)=1 - \prod_{a \in A}\text{softmax}(Q(x, a))[a] \] where \( Q(x, a) \) is the Q - value of taking action \( a \) in state \( x \), and \( A \) is the action space of the task. - **Stochastic Certainty**: \[ S(x, A, a') = 1 - P[M(x') = a'|x' = A(x)] \] where \( A(x) \) is the state obtained after applying the action sequence \( A \) to state \( x \). - **Loss Function**: \[ L(x, A, a')=\alpha R(x, A)+\beta F(x, A)+\gamma S(x, A, a') \] where \( \alpha \), \( \beta \), \( \gamma \) are parameters that control the importance of different properties. Through these improvements, RACCER can ensure that the generated counterfactual explanations have practical operational significance for users while maintaining the effectiveness of the explanations.

RACCER: Towards Reachable and Certain Counterfactual Explanations for Reinforcement Learning

Redefining Counterfactual Explanations for Reinforcement Learning: Overview, Challenges and Opportunities

Counterfactual State Explanations for Reinforcement Learning Agents via Generative Deep Learning

ACTER: Diverse and Actionable Counterfactual Sequences for Explaining and Diagnosing RL Policies

Counterfactual Explanation Policies in RL

Experiential Explanations for Reinforcement Learning

Explain the Explainer: Interpreting Model-Agnostic Counterfactual Explanations of a Deep Reinforcement Learning Agent

Counterfactual Explainer Framework for Deep Reinforcement Learning Models Using Policy Distillation

Counterfactual Explanations via Locally-guided Sequential Algorithmic Recourse

SAFE-RL: Saliency-Aware Counterfactual Explainer for Deep Reinforcement Learning Policies

Counterfactual Explanations and Algorithmic Recourses for Machine Learning: A Review

Contrastive Explanations for Reinforcement Learning in terms of Expected Consequences

Reinforced Path Reasoning for Counterfactual Explainable Recommendation

Explaining Reinforcement Learning Agents Through Counterfactual Action Outcomes

RICE: Breaking Through the Training Bottlenecks of Reinforcement Learning with Explanation

Causal Counterfactuals for Improving the Robustness of Reinforcement Learning

Learning impartial policies for sequential counterfactual explanations using Deep Reinforcement Learning

A Survey on Explainable Reinforcement Learning: Concepts, Algorithms, Challenges

Counterfactual explanations and how to find them: literature review and benchmarking

Advantage Actor-Critic with Reasoner: Explaining the Agent's Behavior from an Exploratory Perspective.

Explaining RL Decisions with Trajectories