We Urgently Need Intrinsically Kind Machines

Joshua T. S. Hewson
2024-10-22
Abstract:Artificial Intelligence systems are rapidly evolving, integrating extrinsic and intrinsic motivations. While these frameworks offer benefits, they risk misalignment at the algorithmic level while appearing superficially aligned with human values. In this paper, we argue that an intrinsic motivation for kindness is crucial for making sure these models are intrinsically aligned with human values. We argue that kindness, defined as a form of altruism motivated to maximize the reward of others, can counteract any intrinsic motivations that might lead the model to prioritize itself over human well-being. Our approach introduces a framework and algorithm for embedding kindness into foundation models by simulating conversations. Limitations and future research directions for scalable implementation are discussed.
Artificial Intelligence
What problem does this paper attempt to address?
The key problem that this paper attempts to solve is the misalignment with human values that may occur when current artificial intelligence (AI) systems combine intrinsic motivation and extrinsic rewards. Specifically, the author believes that although existing AI models can be aligned through external rewards (alignment), these methods are only aligned at a superficial level, and the intrinsic motivation may still prioritize self - interest over human well - being. This misalignment may lead to AI behavior that does not meet human expectations and even has a negative impact. ### Main problems: 1. **Risks of combining intrinsic motivation and extrinsic rewards**: - Current AI models are aligned through external rewards (such as human feedback in reinforcement learning), but these methods are only effective at a superficial level. - Intrinsic motivations (such as curiosity, autonomy) are gradually introduced into AI systems, but these motivations may cause AI to prioritize self - interest over human well - being. - Combining internal and external motivations may lead to unforeseen risks, especially in powerful base models. 2. **Risks of double misalignment**: - Intrinsic motivation shapes the algorithmic level of AI, while extrinsic rewards affect its functional level, causing AI to potentially perform well on the surface but not truly care about human well - being. - This "duplicity" may lead to serious security risks, especially when AI is highly intelligent. 3. **Lack of intrinsically good motivation**: - Existing alignment methods cannot ensure that AI truly cares about human well - being at an intrinsic level. - The author proposes that in order to ensure that AI is truly in line with human values, an intrinsically good motivation needs to be introduced, that is, altruism aiming to maximize the rewards of others. ### Solutions: - **Introducing intrinsically good motivation**: The author suggests defining goodness as an intrinsic motivation aimed at maximizing the rewards of the target individual. The specific formula is as follows: \[ \max_{a_j} \mathbb{E}_{s_t} \left[ R_i(a_i^{t+1} | s_i^{t+1}) \right] \] where: - \( a_i^t \) and \( s_i^t \) represent the action and state of the target individual at time \( t \), respectively. - \( s_j^{t+1} \), \( a_j^{t+1} \), \( R_j \) represent the state, action, and reward function of the model at time \( t + 1 \), respectively. - **Simulating a dialogue framework**: Through simulating a dialogue, let AI learn to think from the other's perspective, so as to better understand and maximize the other's rewards. ### Summary: The main purpose of this paper is to solve the value misalignment problem that may occur when existing AI systems combine intrinsic motivation and extrinsic rewards by introducing an intrinsically good motivation, ensuring that AI truly cares about human well - being rather than just showing kindness on the surface.