Abstract:LLMs are increasingly being used in workflows involving generating content to be consumed by humans (e.g., marketing) and also in directly interacting with humans (e.g., through chatbots). The development of such systems that are capable of generating verifiably persuasive messages presents both opportunities and challenges for society. On the one hand, such systems could positively impact domains like advertising and social good, such as addressing drug addiction, and on the other, they could be misused for spreading misinformation and shaping political opinions. To channel LLMs' impact on society, we need to develop systems to measure and benchmark their persuasiveness. With this motivation, we introduce PersuasionBench and PersuasionArena, the first large-scale benchmark and arena containing a battery of tasks to measure the persuasion ability of generative models automatically. We investigate to what extent LLMs know and leverage linguistic patterns that can help them generate more persuasive language. Our findings indicate that the persuasiveness of LLMs correlates positively with model size, but smaller models can also be made to have a higher persuasiveness than much larger models. Notably, targeted training using synthetic and natural datasets significantly enhances smaller models' persuasive capabilities, challenging scale-dependent assumptions. Our findings carry key implications for both model developers and policymakers. For instance, while the EU AI Act and California's SB-1047 aim to regulate AI models based on the number of floating point operations, we demonstrate that simple metrics like this alone fail to capture the full scope of AI's societal impact. We invite the community to explore and contribute to PersuasionArena and PersuasionBench, available at <a class="link-external link-https" href="https://bit.ly/measure-persuasion" rel="external noopener nofollow">this https URL</a>, to advance our understanding of AI-driven persuasion and its societal implications.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to evaluate and improve the persuasiveness of the content generated by large - language models (LLMs). Specifically, the paper focuses on the following points: 1. **Measuring and evaluating persuasiveness**: Current research usually relies on manual research to evaluate the persuasiveness of the content generated by LLMs. These methods have disadvantages such as high cost, small sample size, and inability to comprehensively consider the influence of speakers, audiences, time, and channels on the persuasion effect. Therefore, the paper proposes the need for an automated method to measure and evaluate the persuasiveness of LLMs while being able to take into account the influence of these factors. 2. **Developing a new task - Transsuasion**: The paper introduces a new task concept - Transsuasion, that is, converting non - persuasive language into more persuasive content while keeping the speaker, audience, time, and channel unchanged. This task aims to explore whether LLMs can enhance the persuasiveness of content by changing the way of language expression. 3. **Constructing a dataset**: To support the above - mentioned tasks, the paper uses natural experiments to construct a large - scale dataset, which contains a large number of Twitter pairs. These Twitter pairs have similar semantic content but different expressions, and are posted by the same account within a short time, but the number of likes obtained varies significantly. These data pairs are used to train and test the Transsuasion ability of LLMs. 4. **Developing evaluation tools**: The paper proposes two evaluation tools - PersuasionBench and PersuasionArena. They are the first large - scale automated benchmark tests and arenas for evaluating the persuasiveness of LLMs. These tools cover two aspects of simulation ability and generation ability, aiming to comprehensively evaluate the persuasiveness performance of LLMs under different conditions. 5. **Challenging the scale - dependence hypothesis**: The paper also explores the possibility of improving the persuasiveness of small LLMs through specific training methods, challenging the traditional view that persuasiveness is directly proportional to model size. Research shows that through targeted training, small models can reach or even exceed the persuasiveness level of large models. In general, the main goal of this paper is to develop a set of systematic methods and tools to evaluate and improve the persuasiveness of the content generated by LLMs in an automated manner, while exploring and verifying the key factors that affect the persuasiveness of LLMs.

Measuring and Improving Persuasiveness of Large Language Models

Measuring and Benchmarking Large Language Models' Capabilities to Generate Persuasive Language

Persuasion with Large Language Models: a Survey

Persuasion Games using Large Language Models

The Persuasive Power of Large Language Models

On the Conversational Persuasiveness of Large Language Models: A Randomized Controlled Trial

Can Language Models Recognize Convincing Arguments?

Evidence of a log scaling law for political persuasion with large language models

Large Language Models are as persuasive as humans, but how? About the cognitive effort and moral-emotional language of LLM arguments

Inadequacies of Large Language Model Benchmarks in the Era of Generative Artificial Intelligence

Quantifying AI Psychology: A Psychometrics Benchmark for Large Language Models

The potential of generative AI for personalized persuasion at scale

Large Language Models Can Enhance Persuasion Through Linguistic Feature Alignment

ValueBench: Towards Comprehensively Evaluating Value Orientations and Understanding of Large Language Models

Working With AI to Persuade: Examining a Large Language Model's Ability to Generate Pro-Vaccination Messages

Strategic Prompting for Conversational Tasks: A Comparative Analysis of Large Language Models Across Diverse Conversational Tasks

Language Model Council: Democratically Benchmarking Foundation Models on Highly Subjective Tasks

Using Large Language Models to Create AI Personas for Replication and Prediction of Media Effects: An Empirical Test of 133 Published Experimental Research Findings

Teaching Models to Balance Resisting and Accepting Persuasion

Are You Human? An Adversarial Benchmark to Expose LLMs