Can LLMs Reliably Simulate Human Learner Actions? A Simulation Authoring Framework for Open-Ended Learning Environments

Amogh Mannekote,Adam Davies,Jina Kang,Kristy Elizabeth Boyer
2024-10-13
Abstract:Simulating learner actions helps stress-test open-ended interactive learning environments and prototype new adaptations before deployment. While recent studies show the promise of using large language models (LLMs) for simulating human behavior, such approaches have not gone beyond rudimentary proof-of-concept stages due to key limitations. First, LLMs are highly sensitive to minor prompt variations, raising doubts about their ability to generalize to new scenarios without extensive prompt engineering. Moreover, apparently successful outcomes can often be unreliable, either because domain experts unintentionally guide LLMs to produce expected results, leading to self-fulfilling prophecies; or because the LLM has encountered highly similar scenarios in its training data, meaning that models may not be simulating behavior so much as regurgitating memorized content. To address these challenges, we propose Hyp-Mix, a simulation authoring framework that allows experts to develop and evaluate simulations by combining testable hypotheses about learner behavior. Testing this framework in a physics learning environment, we found that GPT-4 Turbo maintains calibrated behavior even as the underlying learner model changes, providing the first evidence that LLMs can be used to simulate realistic behaviors in open-ended interactive learning environments, a necessary prerequisite for useful LLM behavioral simulation.
Artificial Intelligence,Computation and Language,Machine Learning
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve several key challenges that large language models (LLMs) face when simulating human learner behavior. Specifically, the authors focus on how to enable LLMs to reliably simulate human learner behavior in open - ended interactive learning environments, in order to help developers test and optimize these environments before deployment. The following are the main problems mentioned in the paper: 1. **Prompt Sensitivity**: - LLMs are very sensitive to small changes in prompts, which makes it difficult for them to generalize to new scenarios. Even very small prompt changes can lead to significant changes in the performance of LLMs. 2. **Memorization vs. Reasoning**: - LLMs may rely on memorized content in the training data rather than performing true reasoning. This means that LLMs may simply repeat previously seen content rather than truly simulating behavior. 3. **Self - Fulfilling Prophecies**: - Domain experts unintentionally lead LLMs to produce expected results, resulting in unreliable simulation results. In this case, the behavior of LLMs is more like being artificially guided rather than based on real reasoning. 4. **Lack of Systematic Evaluation**: - Currently, there is no systematic method to evaluate whether LLMs can maintain consistency and accuracy in different situations, and there is no clear work flow to avoid bias. To solve these problems, the authors propose a framework called HYP - MIX, which allows experts to develop and evaluate simulations by combining testable hypotheses about learner behavior. The core idea of HYP - MIX is to use marginal distribution hypotheses (MDHyps) to define and verify the statistical relationships of learner behavior, thereby ensuring that LLMs can maintain consistent and reliable behavior simulations in different learning environments. Through this method, HYP - MIX aims to overcome the limitations of existing LLM simulation methods and provide a basis for developing more reliable and flexible learner behavior simulation tools.