Abstract:Simulating learner actions helps stress-test open-ended interactive learning environments and prototype new adaptations before deployment. While recent studies show the promise of using large language models (LLMs) for simulating human behavior, such approaches have not gone beyond rudimentary proof-of-concept stages due to key limitations. First, LLMs are highly sensitive to minor prompt variations, raising doubts about their ability to generalize to new scenarios without extensive prompt engineering. Moreover, apparently successful outcomes can often be unreliable, either because domain experts unintentionally guide LLMs to produce expected results, leading to self-fulfilling prophecies; or because the LLM has encountered highly similar scenarios in its training data, meaning that models may not be simulating behavior so much as regurgitating memorized content. To address these challenges, we propose Hyp-Mix, a simulation authoring framework that allows experts to develop and evaluate simulations by combining testable hypotheses about learner behavior. Testing this framework in a physics learning environment, we found that GPT-4 Turbo maintains calibrated behavior even as the underlying learner model changes, providing the first evidence that LLMs can be used to simulate realistic behaviors in open-ended interactive learning environments, a necessary prerequisite for useful LLM behavioral simulation.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve several key challenges that large language models (LLMs) face when simulating human learner behavior. Specifically, the authors focus on how to enable LLMs to reliably simulate human learner behavior in open - ended interactive learning environments, in order to help developers test and optimize these environments before deployment. The following are the main problems mentioned in the paper: 1. **Prompt Sensitivity**: - LLMs are very sensitive to small changes in prompts, which makes it difficult for them to generalize to new scenarios. Even very small prompt changes can lead to significant changes in the performance of LLMs. 2. **Memorization vs. Reasoning**: - LLMs may rely on memorized content in the training data rather than performing true reasoning. This means that LLMs may simply repeat previously seen content rather than truly simulating behavior. 3. **Self - Fulfilling Prophecies**: - Domain experts unintentionally lead LLMs to produce expected results, resulting in unreliable simulation results. In this case, the behavior of LLMs is more like being artificially guided rather than based on real reasoning. 4. **Lack of Systematic Evaluation**: - Currently, there is no systematic method to evaluate whether LLMs can maintain consistency and accuracy in different situations, and there is no clear work flow to avoid bias. To solve these problems, the authors propose a framework called HYP - MIX, which allows experts to develop and evaluate simulations by combining testable hypotheses about learner behavior. The core idea of HYP - MIX is to use marginal distribution hypotheses (MDHyps) to define and verify the statistical relationships of learner behavior, thereby ensuring that LLMs can maintain consistent and reliable behavior simulations in different learning environments. Through this method, HYP - MIX aims to overcome the limitations of existing LLM simulation methods and provide a basis for developing more reliable and flexible learner behavior simulation tools.

Can LLMs Reliably Simulate Human Learner Actions? A Simulation Authoring Framework for Open-Ended Learning Environments

Dialogue Learning with Human-in-the-Loop.

Leveraging generative artificial intelligence to simulate student learning behavior

Limited Ability of LLMs to Simulate Human Psychological Behaviours: a Psychometric Analysis

Evaluation of Code Generation for Simulating Participant Behavior in Experience Sampling Method by Iterative In-Context Learning of a Large Language Model

Let the LLMs Talk: Simulating Human-to-Human Conversational QA via Zero-Shot LLM-to-LLM Interactions

GPT-Based Models Meet Simulation: How to Efficiently Use Large-Scale Pre-Trained Language Models Across Simulation Tasks

Physics simulation capabilities of LLMs

SimulBench: Evaluating Language Models with Creative Simulation Tasks

Towards A Human-in-the-Loop LLM Approach to Collaborative Discourse Analysis

How Far Are LLMs from Believable AI? A Benchmark for Evaluating the Believability of Human Behavior Simulation

LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error

Enhancing LLMs for Power System Simulations: A Feedback-driven Multi-agent Framework

clembench-2024: A Challenging, Dynamic, Complementary, Multilingual Benchmark and Underlying Flexible Framework for LLMs as Multi-Action Agents

Broadening Access to Simulations for End-Users via Large Language Models: Challenges and Opportunities

User Behavior Simulation with Large Language Model based Agents

PlatoLM: Teaching LLMs in Multi-Round Dialogue via a User Simulator

LLMs-as-Instructors: Learning from Errors Toward Automating Model Improvement

Large Language Models for In-Context Student Modeling: Synthesizing Student's Behavior in Visual Programming