Abstract:While reinforcement learning (RL) algorithms have been successfully applied to numerous tasks, their reliance on neural networks makes their behavior difficult to understand and trust. Counterfactual explanations are human-friendly explanations that offer users actionable advice on how to alter the model inputs to achieve the desired output from a black-box system. However, current approaches to generating counterfactuals in RL ignore the stochastic and sequential nature of RL tasks and can produce counterfactuals that are difficult to obtain or do not deliver the desired outcome. In this work, we propose RACCER, the first RL-specific approach to generating counterfactual explanations for the behavior of RL agents. We first propose and implement a set of RL-specific counterfactual properties that ensure easily reachable counterfactuals with highly probable desired outcomes. We use a heuristic tree search of the agent's execution trajectories to find the most suitable counterfactuals based on the defined properties. We evaluate RACCER in two tasks as well as conduct a user study to show that RL-specific counterfactuals help users better understand agents' behavior compared to the current state-of-the-art approaches.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to generate interpretable and feasible counterfactual explanations for Reinforcement Learning (RL) to help users better understand and trust the behavior of RL models. Specifically, the existing counterfactual explanation methods have the following problems when dealing with RL tasks:
1. **Ignoring sequentiality and randomness**: The existing methods do not fully consider the sequentiality of RL tasks and the randomness of the environment, resulting in generated counterfactual explanations that may be difficult to implement or cannot guarantee the expected results.
2. **Insufficient feature similarity**: In RL, although two states have similar features, they may be far apart during execution. Therefore, methods that rely solely on feature similarity may generate infeasible counterfactual explanations.
3. **Lack of specificity**: The existing methods fail to distinguish between counterfactual explanations of past causes and future actions, resulting in explanations that are not specific and useful enough.
To solve these problems, the paper proposes RACCER (Reachable and Certain Counterfactual Explanations for Reinforcement Learning), which is the first counterfactual explanation method specifically designed for RL. RACCER ensures that the generated counterfactual explanations are both easy to implement and can produce the expected results with a high probability by introducing three RL - specific counterfactual properties - reachability, stochastic certainty, and fidelity.
### Main contributions
1. **Proposing three RL - specific counterfactual properties**: reachability, stochastic certainty, and fidelity, and providing evaluation metrics for these properties.
2. **Designing the RACCER algorithm**: This algorithm generates RL - specific counterfactual explanations based on the above properties, can be applied to any RL model, and does not require access to the internal parameters of the model.
3. **User study**: Through user experiments, it is verified that the counterfactual explanations generated by RACCER can help users better understand the behavior of RL agents, and the effect is better than that of the existing methods.
### Summary of mathematical formulas
- **Reachability**:
\[
R(x, A)=\text{len}(A)
\]
where \( R(x, A) \) represents the length of the action sequence \( A \) required to move from state \( x \) to the counterfactual state \( x' \).
- **Fidelity**:
\[
F(x, A)=1 - \prod_{a \in A}\text{softmax}(Q(x, a))[a]
\]
where \( Q(x, a) \) is the Q - value of taking action \( a \) in state \( x \), and \( A \) is the action space of the task.
- **Stochastic Certainty**:
\[
S(x, A, a') = 1 - P[M(x') = a'|x' = A(x)]
\]
where \( A(x) \) is the state obtained after applying the action sequence \( A \) to state \( x \).
- **Loss Function**:
\[
L(x, A, a')=\alpha R(x, A)+\beta F(x, A)+\gamma S(x, A, a')
\]
where \( \alpha \), \( \beta \), \( \gamma \) are parameters that control the importance of different properties.
Through these improvements, RACCER can ensure that the generated counterfactual explanations have practical operational significance for users while maintaining the effectiveness of the explanations.