Instructed to Bias: Instruction-Tuned Language Models Exhibit Emergent Cognitive Bias

Itay Itzhak,Gabriel Stanovsky,Nir Rosenfeld,Yonatan Belinkov
2024-03-31
Abstract:Recent studies show that instruction tuning (IT) and reinforcement learning from human feedback (RLHF) improve the abilities of large language models (LMs) dramatically. While these tuning methods can help align models with human objectives and generate high-quality text, not much is known about their potential adverse effects. In this work, we investigate the effect of IT and RLHF on decision making and reasoning in LMs, focusing on three cognitive biases - the decoy effect, the certainty effect, and the belief bias - all of which are known to influence human decision-making and reasoning. Our findings highlight the presence of these biases in various models from the GPT-3, Mistral, and T5 families. Notably, we find a stronger presence of biases in models that have undergone instruction tuning, such as Flan-T5, Mistral-Instruct, GPT3.5, and GPT4. Our work constitutes a step toward comprehending cognitive biases in instruction-tuned LMs, which is crucial for the development of more reliable and unbiased language models.
Artificial Intelligence,Computers and Society,Machine Learning
What problem does this paper attempt to address?
This paper investigates the impact of instruction tuning (IT) and reinforcement learning from human feedback (RLHF) on the decision-making and reasoning capabilities of large language models (LLMs). The study focuses on three cognitive biases: the decoy effect, the certainty effect, and the belief bias, all of which are known to affect human decision-making and reasoning. The study finds that models tuned by instructions, such as Flan-T5, Mistral-Instruct, GPT3.5, and GPT4, exhibit stronger biases. This suggests that while IT and RLHF can improve the alignment of models with human objectives and the quality of the generated text, they may also introduce or exacerbate biases. The paper proposes that understanding the cognitive biases in instruction-tuned LLMs is crucial for developing more reliable and unbiased language models. The findings of the study indicate that pursuing consistency with human behavior in models may result in unexpected biased behaviors in other aspects.