CS4: Measuring the Creativity of Large Language Models Automatically by Controlling the Number of Story-Writing Constraints

Anirudh Atmakuru,Jatin Nainani,Rohith Siddhartha Reddy Bheemreddy,Anirudh Lakkaraju,Zonghai Yao,Hamed Zamani,Haw-Shiuan Chang
2024-10-05
Abstract:Evaluating the creativity of large language models (LLMs) in story writing is difficult because LLM-generated stories could seemingly look creative but be very similar to some existing stories in their huge and proprietary training corpus. To overcome this challenge, we introduce a novel benchmark dataset with varying levels of prompt specificity: CS4 ($\mathbf{C}$omparing the $\mathbf{S}$kill of $\mathbf{C}$reating $\mathbf{S}$tories by $\mathbf{C}$ontrolling the $\mathbf{S}$ynthesized $\mathbf{C}$onstraint $\mathbf{S}$pecificity). By increasing the number of requirements/constraints in the prompt, we can increase the prompt specificity and hinder LLMs from retelling high-quality narratives in their training data. Consequently, CS4 empowers us to indirectly measure the LLMs' creativity without human annotations. Our experiments on LLaMA, Gemma, and Mistral not only highlight the creativity challenges LLMs face when dealing with highly specific prompts but also reveal that different LLMs perform very differently under different numbers of constraints and achieve different balances between the model's instruction-following ability and narrative coherence. Additionally, our experiments on OLMo suggest that Learning from Human Feedback (LHF) can help LLMs select better stories from their training data but has limited influence in boosting LLMs' ability to produce creative stories that are unseen in the training corpora. The benchmark is released at <a class="link-external link-https" href="https://github.com/anirudhlakkaraju/cs4_benchmark" rel="external noopener nofollow">this https URL</a>.
Computation and Language
What problem does this paper attempt to address?
The problem this paper attempts to address is the evaluation of large language models (LLMs) in terms of creativity in story generation. Specifically, existing evaluation methods often fail to effectively measure the creativity of LLMs because these models can generate content that appears creative but is actually very similar to stories already present in the training data. To overcome this challenge, the authors introduce a new benchmark dataset, CS4 (Comparing the Skill of Creating Stories by Controlling the Synthesized Constraint Specificity), which increases the specificity of prompts by adding more constraints, thereby limiting the LLMs' ability to reproduce high-quality narratives from their training data. In this way, CS4 can indirectly measure the creativity of LLMs without the need for manual annotation. ### Main Contributions: 1. **New Benchmark Dataset**: CS4 contains prompts with varying levels of specificity, controlling the specificity of the prompts by increasing the number of constraints. 2. **Creativity Evaluation**: By comparing the performance of different LLMs under different constraints, the creativity of these models is indirectly assessed. 3. **Experimental Results**: The experimental results show that different LLMs face different creativity challenges when dealing with highly specific prompts and exhibit different balances between instruction-following ability and narrative coherence. ### Specific Issues: - **Definition of Creativity**: Creativity is defined as the ability to generate original, unseen, and high-quality stories. - **Limitations of Existing Methods**: Current evaluation methods typically allow LLMs to perform well by slightly modifying or copying relevant content from the training data, without truly "understanding" the output stories. - **Challenges**: Directly checking whether the generated stories mainly come from the model's training data is difficult because LLMs may output text that is semantically similar but lexically different from the training text, and the training data is usually vast and not publicly available. ### Solution: - **CS4 Benchmark**: By increasing the number of constraints in the prompts, it becomes difficult for LLMs to copy content from the training data, thus evaluating their creativity. - **Multi-Metric Evaluation**: In addition to common metrics like instruction-following rate, coherence, and diversity, CS4 uses multiple prompt sets with different numbers of constraints to compare the performance of LLMs under different prompt specificities. ### Experimental Results: - **Performance Decline**: The quality of stories output by all explored LLMs significantly declines as the prompt specificity increases, indicating that more constraints indeed pose a greater challenge to the creativity of LLMs. - **Performance of Different Models**: Some LLMs deteriorate faster with increased constraints, for example, LLaMA-2 7B shows a greater decline in constraint satisfaction probability compared to Gemma-7B. - **Diversity Changes**: The degradation speed of different metrics varies, for instance, LLaMA-2 maintains similar coherence but the number of satisfied constraints decreases by 25% with increased constraints. - **Impact of LHF**: By comparing different versions of OLMo, it is found that direct preference optimization (DPO) or more broadly, learning from human feedback (LHF), contributes little to performance improvement when constraints are increased. In summary, this paper introduces the CS4 benchmark dataset, providing a new method to evaluate the creativity of LLMs in story generation, revealing the performance differences of different models when faced with highly specific prompts.