Abstract:Self-correction is one of the most amazing emerging capabilities of Large Language Models (LLMs), enabling LLMs to self-modify an inappropriate output given a natural language feedback which describes the problems of that output. Moral self-correction is a post-hoc approach correcting unethical generations without requiring a gradient update, making it both computationally lightweight and capable of preserving the language modeling ability. Previous works have shown that LLMs can self-debias, and it has been reported that small models, i.e., those with less than 22B parameters, are not capable of moral self-correction. However, there is no direct proof as to why such smaller models fall short of moral self-correction, though previous research hypothesizes that larger models are skilled in following instructions and understanding abstract social norms. In this paper, we empirically validate this hypothesis in the context of social stereotyping, through meticulous prompting. Our experimental results indicate that (i) surprisingly, 3.8B LLMs with proper safety alignment fine-tuning can achieve very good moral self-correction performance, highlighting the significant effects of safety alignment; and (ii) small LLMs are indeed weaker than larger-scale models in terms of comprehending social norms and self-explanation through CoT, but all scales of LLMs show bad self-correction performance given unethical instructions.
What problem does this paper attempt to address?
The problem this paper attempts to address is whether small-scale large language models (with fewer than 22 billion parameters) can perform moral self-correction. Specifically, the authors demonstrate the following points through experiments:
1. **Ability to understand abstract social norms**: Whether small models can understand abstract social norms.
2. **Ability to follow instructions**: Whether small models can effectively follow natural language instructions.
3. **Chain-of-Thought explanation ability**: Whether small models can effectively explain in a Chain-of-Thought (CoT) manner.
Background research indicates that large language models (LLMs) have the ability to self-correct, i.e., modify inappropriate content after receiving natural language feedback. However, previous studies have suggested that small models with fewer than 22 billion parameters do not possess this moral self-correction ability. This paper experimentally verifies this hypothesis and finds that, after appropriate safety alignment fine-tuning, a model with 380 million parameters can also achieve good moral self-correction performance.
### Main Contributions
1. **Experimental Evidence**: Through a series of experiments, it is demonstrated that a model with 380 million parameters can achieve good moral self-correction performance after appropriate fine-tuning.
2. **Importance of Safety Alignment**: Emphasizes the crucial role of safety alignment in improving the moral self-correction performance of models.
3. **Comparison of Different Scale Models**: Compares the performance of models of different scales in moral self-correction tasks, revealing the impact of model scale on moral self-correction ability.
### Experimental Setup
- **Model Selection**: Different scale language models were used, including models with 355M, 774M, 1B, 3.8B, 7B, 13B, and 70B parameters.
- **Datasets**: The Winogender benchmark and BBQ benchmark were used, focusing on gender bias and social bias, respectively.
- **Experimental Design**: Specific instructions were used to test the models' understanding ability, ability to follow instructions, and Chain-of-Thought explanation ability.
### Experimental Results
- **Moral Self-Correction Ability**: All models with more than 380 million parameters were able to achieve positive moral self-correction, outperforming the baseline models.
- **Chain-of-Thought Explanation**: The model with 70 billion parameters performed best in Chain-of-Thought explanations, but the performance of other scale models varied.
- **Response to Negative Instructions**: All considered models failed to completely refuse to execute immoral instructions, indicating a need to further enhance the models' ability to recognize and reject immoral instructions.
### Conclusion
This paper experimentally demonstrates that small language models with no fewer than 380 million parameters, after appropriate safety alignment fine-tuning, indeed possess the ability for moral self-correction. Additionally, increasing the specificity of instructions can significantly enhance the models' self-correction performance. Safety alignment plays a key role in this process.