Abstract:We study the problem of controlling the difficulty level of text generated by Large Language Models (LLMs) for contexts where end-users are not fully proficient, such as language learners. Using a novel framework, we evaluate the effectiveness of several key approaches for this task, including few-shot prompting, supervised finetuning, and reinforcement learning (RL), utilising both GPT-4 and open source alternatives like LLama2-7B and Mistral-7B. Our findings reveal a large performance gap between GPT-4 and the open source models when using prompt-based strategies. However, we show how to bridge this gap with a careful combination of finetuning and RL alignment. Our best model, CALM (CEFR-Aligned Language Model), surpasses the performance of GPT-4 and other strategies, at only a fraction of the cost. We further validate the quality of our results through a small-scale human study.

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper aims to address the issue of large language models (LLMs) being unable to control the difficulty level of the text they generate, especially when the target users are not fully proficient, such as language learners, children, or non-native speakers. Specifically, the research proposes a new framework—"Proficiency Control Task" (PCT)—to evaluate the model's ability to adjust the language proficiency level while generating high-quality content under given instructions. ### Research Methods and Findings 1. **Prompt Strategies**: - Various prompt-based methods were studied, including directly asking the model to generate text at a specific level, describing CEFR levels, and providing example texts. - It was found that for GPT-4, more complex prompts can significantly reduce ControlError, but the effect is less pronounced for open-source models (such as LLama-2-7b and Mistral-7b). 2. **Supervised Fine-Tuning**: - Effective prompt strategies from GPT-4 were used to generate data to fine-tune open-source models to improve their performance on the PCT task. - After fine-tuning, the ControlError of open-source models (LLama-2-7b and Mistral-7b) was significantly reduced, but they still slightly lagged behind GPT-4. 3. **Reinforcement Learning (PPO)**: - The Proximal Policy Optimization (PPO) algorithm was further used to align the model outputs to better match the target proficiency level. - Results showed that the model trained with PPO (referred to as CALM) outperformed GPT-4 in terms of ControlError and was more cost-effective. 4. **Sampling Strategy**: - A simple yet effective sampling strategy was proposed, which involves generating multiple samples for each prompt and selecting the one with the lowest ControlError. - Using this method, the CALM model strictly outperformed any prompt strategy of GPT-4 in the Pareto sense. ### Summary The paper primarily addresses how to enable large language models to generate text at different proficiency levels based on demand and proposes a series of methods to improve the performance of open-source models. Ultimately, the CALM model not only excelled in controlling errors but also had a much lower cost compared to GPT-4. Additionally, human evaluations validated the quality of the content generated by these models.

From Tarzan to Tolkien: Controlling the Language Proficiency Level of LLMs for Content Generation

Understanding and Mitigating Language Confusion in LLMs

Learning to Generate Better Than Your LLM

Should We Fine-Tune or RAG? Evaluating Different Techniques to Adapt LLMs for Dialogue

Evaluating Language Models for Generating and Judging Programming Feedback

Improving the Language Understanding Capabilities of Large Language Models Using Reinforcement Learning

Beyond Metrics: Evaluating LLMs' Effectiveness in Culturally Nuanced, Low-Resource Real-World Scenarios

Multi-Objective Linguistic Control of Large Language Models

Enhancing Large Language Model Performance To Answer Questions and Extract Information More Accurately

Pedagogical Alignment of Large Language Models

Vernacular? I Barely Know Her: Challenges with Style Control and Stereotyping

LLMs achieve adult human performance on higher-order theory of mind tasks

Breaking Language Barriers with a LEAP: Learning Strategies for Polyglot LLMs

See What LLMs Cannot Answer: A Self-Challenge Framework for Uncovering LLM Weaknesses

Enhancing LLM Evaluations: The Garbling Trick

Wait, that's not an option: LLMs Robustness with Incorrect Multiple-Choice Options

Evaluating Students' Open-ended Written Responses with LLMs: Using the RAG Framework for GPT-3.5, GPT-4, Claude-3, and Mistral-Large

Generating Educational Materials with Different Levels of Readability using LLMs

Can LLMs replace Neil deGrasse Tyson? Evaluating the Reliability of LLMs as Science Communicators

Improving Bilingual Capabilities of Language Models to Support Diverse Linguistic Practices in Education

Balancing Enhancement, Harmlessness, and General Capabilities: Enhancing Conversational LLMs with Direct RLHF