A Theoretical Understanding of Self-Correction through In-context Alignment

Yifei Wang,Yuyang Wu,Zeming Wei,Stefanie Jegelka,Yisen Wang
2024-05-29
Abstract:Going beyond mimicking limited human experiences, recent studies show initial evidence that, like humans, large language models (LLMs) are capable of improving their abilities purely by self-correction, i.e., correcting previous responses through self-examination, in certain circumstances. Nevertheless, little is known about how such capabilities arise. In this work, based on a simplified setup akin to an alignment task, we theoretically analyze self-correction from an in-context learning perspective, showing that when LLMs give relatively accurate self-examinations as rewards, they are capable of refining responses in an in-context way. Notably, going beyond previous theories on over-simplified linear transformers, our theoretical construction underpins the roles of several key designs of realistic transformers for self-correction: softmax attention, multi-head attention, and the MLP block. We validate these findings extensively on synthetic datasets. Inspired by these findings, we also illustrate novel applications of self-correction, such as defending against LLM jailbreaks, where a simple self-correction step does make a large difference. We believe that these findings will inspire further research on understanding, exploiting, and enhancing self-correction for building better foundation models.
Machine Learning,Computation and Language
What problem does this paper attempt to address?
The paper primarily explores the self-correction ability of large language models (LLMs) and attempts to theoretically understand how this ability arises. Specifically, the core objectives of the paper can be summarized as follows: 1. **Theoretical Analysis of Self-Correction Ability**: Researchers conducted a theoretical analysis of self-correction from the perspective of in-context learning through simplified settings, similar to alignment tasks. They demonstrated that when LLMs can provide relatively accurate self-checks as rewards, they can refine responses within the context. 2. **Exploring Transformer Design in the Real World**: The research further revealed the roles of several key components of transformer design in the real world (such as softmax attention, multi-head attention, and MLP blocks) in the self-correction process. 3. **Experimental Validation**: The authors validated these findings through extensive experiments on synthetic datasets and showcased some novel application scenarios, such as defending against LLM jailbreak attacks. 4. **Combining Theory and Practice**: The paper not only provides a theoretical foundation but also demonstrates how to utilize these theoretical results to improve the design and application of foundational models. In summary, the core issue that this paper attempts to address is to establish a principled understanding of how the self-correction ability of LLMs emerges and how to enhance this ability through theoretical and experimental means.