Abstract:Self-correction is one of the most amazing emerging capabilities of Large Language Models (LLMs), enabling LLMs to self-modify an inappropriate output given a natural language feedback which describes the problems of that output. Moral self-correction is a post-hoc approach correcting unethical generations without requiring a gradient update, making it both computationally lightweight and capable of preserving the language modeling ability. Previous works have shown that LLMs can self-debias, and it has been reported that small models, i.e., those with less than 22B parameters, are not capable of moral self-correction. However, there is no direct proof as to why such smaller models fall short of moral self-correction, though previous research hypothesizes that larger models are skilled in following instructions and understanding abstract social norms. In this paper, we empirically validate this hypothesis in the context of social stereotyping, through meticulous prompting. Our experimental results indicate that (i) surprisingly, 3.8B LLMs with proper safety alignment fine-tuning can achieve very good moral self-correction performance, highlighting the significant effects of safety alignment; and (ii) small LLMs are indeed weaker than larger-scale models in terms of comprehending social norms and self-explanation through CoT, but all scales of LLMs show bad self-correction performance given unethical instructions.

What problem does this paper attempt to address?

The problem this paper attempts to address is whether small-scale large language models (with fewer than 22 billion parameters) can perform moral self-correction. Specifically, the authors demonstrate the following points through experiments: 1. **Ability to understand abstract social norms**: Whether small models can understand abstract social norms. 2. **Ability to follow instructions**: Whether small models can effectively follow natural language instructions. 3. **Chain-of-Thought explanation ability**: Whether small models can effectively explain in a Chain-of-Thought (CoT) manner. Background research indicates that large language models (LLMs) have the ability to self-correct, i.e., modify inappropriate content after receiving natural language feedback. However, previous studies have suggested that small models with fewer than 22 billion parameters do not possess this moral self-correction ability. This paper experimentally verifies this hypothesis and finds that, after appropriate safety alignment fine-tuning, a model with 380 million parameters can also achieve good moral self-correction performance. ### Main Contributions 1. **Experimental Evidence**: Through a series of experiments, it is demonstrated that a model with 380 million parameters can achieve good moral self-correction performance after appropriate fine-tuning. 2. **Importance of Safety Alignment**: Emphasizes the crucial role of safety alignment in improving the moral self-correction performance of models. 3. **Comparison of Different Scale Models**: Compares the performance of models of different scales in moral self-correction tasks, revealing the impact of model scale on moral self-correction ability. ### Experimental Setup - **Model Selection**: Different scale language models were used, including models with 355M, 774M, 1B, 3.8B, 7B, 13B, and 70B parameters. - **Datasets**: The Winogender benchmark and BBQ benchmark were used, focusing on gender bias and social bias, respectively. - **Experimental Design**: Specific instructions were used to test the models' understanding ability, ability to follow instructions, and Chain-of-Thought explanation ability. ### Experimental Results - **Moral Self-Correction Ability**: All models with more than 380 million parameters were able to achieve positive moral self-correction, outperforming the baseline models. - **Chain-of-Thought Explanation**: The model with 70 billion parameters performed best in Chain-of-Thought explanations, but the performance of other scale models varied. - **Response to Negative Instructions**: All considered models failed to completely refuse to execute immoral instructions, indicating a need to further enhance the models' ability to recognize and reject immoral instructions. ### Conclusion This paper experimentally demonstrates that small language models with no fewer than 380 million parameters, after appropriate safety alignment fine-tuning, indeed possess the ability for moral self-correction. Additionally, increasing the specificity of instructions can significantly enhance the models' self-correction performance. Safety alignment plays a key role in this process.

Smaller Large Language Models Can Do Moral Self-Correction

Is Moral Self-correction An Innate Capability of Large Language Models? A Mechanistic Analysis to Self-correction

Large Language Models have Intrinsic Self-Correction Ability

Intrinsic Self-correction for Enhanced Morality: An Analysis of Internal Mechanisms and the Superficial Hypothesis

Large Language Models Cannot Self-Correct Reasoning Yet

Small Language Models Need Strong Verifiers to Self-Correct Reasoning

Small Language Model Can Self-correct

Confidence Matters: Revisiting Intrinsic Self-Correction Capabilities of Large Language Models

Automatically Correcting Large Language Models: Surveying the landscape of diverse self-correction strategies

Exploring and steering the moral compass of Large Language Models

On the Intrinsic Self-Correction Capability of LLMs: Uncertainty and Latent Concept

CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing

A Theoretical Understanding of Self-Correction through In-context Alignment

Small Language Models Improve Giants by Rewriting Their Outputs

Large-scale moral machine experiment on large language models

Automatically Correcting Large Language Models: Surveying the Landscape of Diverse Automated Correction Strategies

When Can LLMs Actually Correct Their Own Mistakes? A Critical Survey of Self-Correction of LLMs

Confidence v.s. Critique: A Decomposition of Self-Correction Capability for LLMs

Can Large Language Models Reason and Plan?

Large Language Models Can Self-Improve in Long-context Reasoning

The Moral Mind(s) of Large Language Models