Theory of Mind as Intrinsic Motivation for Multi-Agent Reinforcement Learning

Ini Oguntola,Joseph Campbell,Simon Stepputtis,Katia Sycara
2023-07-19
Abstract:The ability to model the mental states of others is crucial to human social intelligence, and can offer similar benefits to artificial agents with respect to the social dynamics induced in multi-agent settings. We present a method of grounding semantically meaningful, human-interpretable beliefs within policies modeled by deep networks. We then consider the task of 2nd-order belief prediction. We propose that ability of each agent to predict the beliefs of the other agents can be used as an intrinsic reward signal for multi-agent reinforcement learning. Finally, we present preliminary empirical results in a mixed cooperative-competitive environment.
Machine Learning,Artificial Intelligence,Multiagent Systems
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **How to use the belief predictions of other agents as intrinsic motivation to improve performance in multi - agent reinforcement learning (MARL)**. Specifically, the author explores whether it is possible to model the beliefs of other agents as an intrinsic reward signal, thereby improving coordination and deception behaviors in multi - agent environments. ### Problem Background 1. **Human Social Intelligence**: Humans can infer the mental states of others (such as beliefs, desires, intentions, etc.) through the "Theory of Mind" (ToM) and use these inferences to predict others' behaviors, adjust their own behaviors, and predict social interactions. 2. **Challenges in Multi - Agent Systems**: In multi - agent systems, traditional reinforcement learning methods usually only focus on the modeling of external behaviors and ignore the modeling of internal mental states. Although some studies have attempted to introduce ToM into multi - agent systems, it is often difficult to evaluate the effectiveness of these methods. ### Core Problems of the Paper The core problem of the paper is: **Can the performance of multi - agent systems be improved by modeling the beliefs of other agents as an intrinsic reward signal?** Specifically, the author proposes the following research questions: - **Can the performance in multi - agent settings be improved by modeling the beliefs of other agents as an intrinsic reward signal?** - **How can semantically meaningful beliefs be embedded into the policies of deep networks and ensure that these beliefs are interpretable?** - **How can second - order belief prediction (that is, one agent predicts the belief of another agent) be used as an intrinsic motivation to stimulate coordination and deception behaviors between agents?** ### Solutions To solve the above problems, the author proposes the following methods: 1. **Belief Modeling**: Through the method of concept learning, semantically meaningful beliefs are embedded into the policies of deep reinforcement learning. These beliefs can be about the state of the environment (for example, whether the door is locked) or the behaviors of other agents. 2. **Second - Order Belief Prediction**: Each agent not only predicts its own beliefs about the environment but also predicts the beliefs of other agents. This second - order belief prediction is used as an intrinsic reward signal to encourage agents to learn to predict the behaviors of other agents. 3. **Experimental Verification**: The author conducted experiments in a mixed cooperation and competition environment. Preliminary results show that using second - order belief prediction as an intrinsic reward signal can significantly improve the performance of multi - agent systems, especially in coordination and deception tasks. ### Formula Summary - **Belief Loss Function**: \[ L_{\text{belief}}=\begin{cases} \text{MSE}(b, b') & \text{if continuous}\\ \text{CE}(b, b') & \text{if discrete} \end{cases} \] where \(b\) is the belief vector of the agent, \(b'\) is the true value, MSE is the mean squared error, and CE is the cross - entropy loss. - **Mutual Information Minimization**: \[ I(B; Z)=D_{\text{KL}}(P_{BZ}\|P_B\otimes P_Z) \] where \(B\) is the belief vector, \(Z\) is the residual vector, and \(D_{\text{KL}}\) is the KL divergence. - **Second - Order Belief Prediction Loss**: \[ r_{\text{tom}}=\begin{cases} -\frac{1}{K}\sum_{i = 1}^{K}\text{MSE}(B_i, b(i)) & \text{if continuous}\\ -\frac{1}{K}\sum_{i = 1}^{K}\text{CE}(B_i, b(i)) & \text{if discrete} \end{cases} \] where \(K\)