Abstract:Evaluating the creativity of large language models (LLMs) in story writing is difficult because LLM-generated stories could seemingly look creative but be very similar to some existing stories in their huge and proprietary training corpus. To overcome this challenge, we introduce a novel benchmark dataset with varying levels of prompt specificity: CS4 ($\mathbf{C}$omparing the $\mathbf{S}$kill of $\mathbf{C}$reating $\mathbf{S}$tories by $\mathbf{C}$ontrolling the $\mathbf{S}$ynthesized $\mathbf{C}$onstraint $\mathbf{S}$pecificity). By increasing the number of requirements/constraints in the prompt, we can increase the prompt specificity and hinder LLMs from retelling high-quality narratives in their training data. Consequently, CS4 empowers us to indirectly measure the LLMs' creativity without human annotations. Our experiments on LLaMA, Gemma, and Mistral not only highlight the creativity challenges LLMs face when dealing with highly specific prompts but also reveal that different LLMs perform very differently under different numbers of constraints and achieve different balances between the model's instruction-following ability and narrative coherence. Additionally, our experiments on OLMo suggest that Learning from Human Feedback (LHF) can help LLMs select better stories from their training data but has limited influence in boosting LLMs' ability to produce creative stories that are unseen in the training corpora. The benchmark is released at <a class="link-external link-https" href="https://github.com/anirudhlakkaraju/cs4_benchmark" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem this paper attempts to address is the evaluation of large language models (LLMs) in terms of creativity in story generation. Specifically, existing evaluation methods often fail to effectively measure the creativity of LLMs because these models can generate content that appears creative but is actually very similar to stories already present in the training data. To overcome this challenge, the authors introduce a new benchmark dataset, CS4 (Comparing the Skill of Creating Stories by Controlling the Synthesized Constraint Specificity), which increases the specificity of prompts by adding more constraints, thereby limiting the LLMs' ability to reproduce high-quality narratives from their training data. In this way, CS4 can indirectly measure the creativity of LLMs without the need for manual annotation. ### Main Contributions: 1. **New Benchmark Dataset**: CS4 contains prompts with varying levels of specificity, controlling the specificity of the prompts by increasing the number of constraints. 2. **Creativity Evaluation**: By comparing the performance of different LLMs under different constraints, the creativity of these models is indirectly assessed. 3. **Experimental Results**: The experimental results show that different LLMs face different creativity challenges when dealing with highly specific prompts and exhibit different balances between instruction-following ability and narrative coherence. ### Specific Issues: - **Definition of Creativity**: Creativity is defined as the ability to generate original, unseen, and high-quality stories. - **Limitations of Existing Methods**: Current evaluation methods typically allow LLMs to perform well by slightly modifying or copying relevant content from the training data, without truly "understanding" the output stories. - **Challenges**: Directly checking whether the generated stories mainly come from the model's training data is difficult because LLMs may output text that is semantically similar but lexically different from the training text, and the training data is usually vast and not publicly available. ### Solution: - **CS4 Benchmark**: By increasing the number of constraints in the prompts, it becomes difficult for LLMs to copy content from the training data, thus evaluating their creativity. - **Multi-Metric Evaluation**: In addition to common metrics like instruction-following rate, coherence, and diversity, CS4 uses multiple prompt sets with different numbers of constraints to compare the performance of LLMs under different prompt specificities. ### Experimental Results: - **Performance Decline**: The quality of stories output by all explored LLMs significantly declines as the prompt specificity increases, indicating that more constraints indeed pose a greater challenge to the creativity of LLMs. - **Performance of Different Models**: Some LLMs deteriorate faster with increased constraints, for example, LLaMA-2 7B shows a greater decline in constraint satisfaction probability compared to Gemma-7B. - **Diversity Changes**: The degradation speed of different metrics varies, for instance, LLaMA-2 maintains similar coherence but the number of satisfied constraints decreases by 25% with increased constraints. - **Impact of LHF**: By comparing different versions of OLMo, it is found that direct preference optimization (DPO) or more broadly, learning from human feedback (LHF), contributes little to performance improvement when constraints are increased. In summary, this paper introduces the CS4 benchmark dataset, providing a new method to evaluate the creativity of LLMs in story generation, revealing the performance differences of different models when faced with highly specific prompts.

CS4: Measuring the Creativity of Large Language Models Automatically by Controlling the Number of Story-Writing Constraints

Assessing and Understanding Creativity in Large Language Models

Benchmarking Language Model Creativity: A Case Study on Code Generation

Evaluating Creative Short Story Generation in Humans and Large Language Models

LLM Discussion: Enhancing the Creativity of Large Language Models via Discussion Framework and Role-Play

Art or Artifice? Large Language Models and the False Promise of Creativity

Evaluating Large Language Model Creativity from a Literary Perspective

Steering Large Language Models to Evaluate and Amplify Creativity

SS-Bench: A Benchmark for Social Story Generation and Evaluation

Assessing the Creativity of LLMs in Proposing Novel Solutions to Mathematical Problems

LiveIdeaBench: Evaluating LLMs' Scientific Creativity and Idea Generation with Minimal Context

Characterising the Creative Process in Humans and Large Language Models

SimulBench: Evaluating Language Models with Creative Simulation Tasks

The creative psychometric item generator: a framework for item generation and validation using large language models

Divergent Creativity in Humans and Large Language Models

The Ability of Large Language Models to Evaluate Constraint-satisfaction in Agent Responses to Open-ended Requests

Creativity Support in the Age of Large Language Models: An Empirical Study Involving Emerging Writers

On the Creativity of Large Language Models

Small Language Models can Outperform Humans in Short Creative Writing: A Study Comparing SLMs with Humans and LLMs

Let's Think Outside the Box: Exploring Leap-of-Thought in Large Language Models with Creative Humor Generation

Evaluating Creativity and Deception in Large Language Models: A Simulation Framework for Multi-Agent Balderdash