Can Language Models Recognize Convincing Arguments?

Paula Rescala,Manoel Horta Ribeiro,Tiancheng Hu,Robert West
2024-10-04
Abstract:The capabilities of large language models (LLMs) have raised concerns about their potential to create and propagate convincing narratives. Here, we study their performance in detecting convincing arguments to gain insights into LLMs' persuasive capabilities without directly engaging in experimentation with humans. We extend a dataset by Durmus and Cardie (2018) with debates, votes, and user traits and propose tasks measuring LLMs' ability to (1) distinguish between strong and weak arguments, (2) predict stances based on beliefs and demographic characteristics, and (3) determine the appeal of an argument to an individual based on their traits. We show that LLMs perform on par with humans in these tasks and that combining predictions from different LLMs yields significant performance gains, surpassing human performance. The data and code released with this paper contribute to the crucial effort of continuously evaluating and monitoring LLMs' capabilities and potential impact. (<a class="link-external link-https" href="https://go.epfl.ch/persuasion-llm" rel="external noopener nofollow">this https URL</a>)
Computation and Language,Computers and Society
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to evaluate the ability of large language models (LLMs) in detecting content that is persuasive to specific groups of people. Specifically, the author explores this issue through three research questions: 1. **RQ1**: Can LLMs judge the quality of arguments and identify convincing arguments? 2. **RQ2**: Can LLMs predict users' positions on specific issues based on their background information (such as demographic characteristics and basic beliefs)? 3. **RQ3**: Can LLMs judge the attractiveness of an argument to a specific individual based on the user's background information? To answer these questions, the author extended a dataset collected by Durmus and Cardie (2018) from a no - longer - operating debate platform (debate.org). They annotated 833 politically - related debates, each containing arguments for both the pros and cons as well as the voting results of the participants. In addition, the dataset also includes the background information of voters, such as gender, age, etc., and their positions on 48 so - called "big issues". The author used this extended dataset to evaluate the performance of four LLMs (GPT - 3.5, GPT - 4, Llama 2, Mistral 7B) in the following three tasks: 1. **Identifying the more persuasive side** (RQ1) 2. **Predicting users' positions on specific issues before the debate** (RQ2) 3. **Predicting users' positions on specific issues after the debate** (RQ3) The study found that LLMs exhibit near - human performance in these three tasks. For example, when judging which debater is better (RQ1), the accuracy rate of GPT - 4 is 60.50%, which is comparable to the accuracy rate of a single voter in the dataset (60.69%). When predicting users' positions on specific issues before and after the debate (RQ2 and RQ3), the performance of LLMs is also similar to that of humans. In addition, the author also found that by combining the prediction results of different LLMs, the performance can be significantly improved and even exceed human performance. These results are helpful for evaluating and monitoring the capabilities of LLMs and their potential social impacts, especially in terms of personalized misinformation and propaganda.