From Tarzan to Tolkien: Controlling the Language Proficiency Level of LLMs for Content Generation

Ali Malik,Stephen Mayhew,Chris Piech,Klinton Bicknell
2024-06-05
Abstract:We study the problem of controlling the difficulty level of text generated by Large Language Models (LLMs) for contexts where end-users are not fully proficient, such as language learners. Using a novel framework, we evaluate the effectiveness of several key approaches for this task, including few-shot prompting, supervised finetuning, and reinforcement learning (RL), utilising both GPT-4 and open source alternatives like LLama2-7B and Mistral-7B. Our findings reveal a large performance gap between GPT-4 and the open source models when using prompt-based strategies. However, we show how to bridge this gap with a careful combination of finetuning and RL alignment. Our best model, CALM (CEFR-Aligned Language Model), surpasses the performance of GPT-4 and other strategies, at only a fraction of the cost. We further validate the quality of our results through a small-scale human study.
Computation and Language,Machine Learning
What problem does this paper attempt to address?
### Problems Addressed by the Paper The paper aims to address the issue of large language models (LLMs) being unable to control the difficulty level of the text they generate, especially when the target users are not fully proficient, such as language learners, children, or non-native speakers. Specifically, the research proposes a new framework—"Proficiency Control Task" (PCT)—to evaluate the model's ability to adjust the language proficiency level while generating high-quality content under given instructions. ### Research Methods and Findings 1. **Prompt Strategies**: - Various prompt-based methods were studied, including directly asking the model to generate text at a specific level, describing CEFR levels, and providing example texts. - It was found that for GPT-4, more complex prompts can significantly reduce ControlError, but the effect is less pronounced for open-source models (such as LLama-2-7b and Mistral-7b). 2. **Supervised Fine-Tuning**: - Effective prompt strategies from GPT-4 were used to generate data to fine-tune open-source models to improve their performance on the PCT task. - After fine-tuning, the ControlError of open-source models (LLama-2-7b and Mistral-7b) was significantly reduced, but they still slightly lagged behind GPT-4. 3. **Reinforcement Learning (PPO)**: - The Proximal Policy Optimization (PPO) algorithm was further used to align the model outputs to better match the target proficiency level. - Results showed that the model trained with PPO (referred to as CALM) outperformed GPT-4 in terms of ControlError and was more cost-effective. 4. **Sampling Strategy**: - A simple yet effective sampling strategy was proposed, which involves generating multiple samples for each prompt and selecting the one with the lowest ControlError. - Using this method, the CALM model strictly outperformed any prompt strategy of GPT-4 in the Pareto sense. ### Summary The paper primarily addresses how to enable large language models to generate text at different proficiency levels based on demand and proposes a series of methods to improve the performance of open-source models. Ultimately, the CALM model not only excelled in controlling errors but also had a much lower cost compared to GPT-4. Additionally, human evaluations validated the quality of the content generated by these models.