Persuasiveness of Generated Free-Text Rationales in Subjective Decisions: A Case Study on Pairwise Argument Ranking

Mohamed Elaraby,Diane Litman,Xiang Lorraine Li,Ahmed Magooda
2024-06-20
Abstract:Generating free-text rationales is among the emergent capabilities of Large Language Models (LLMs). These rationales have been found to enhance LLM performance across various NLP tasks. Recently, there has been growing interest in using these rationales to provide insights for various important downstream tasks. In this paper, we analyze generated free-text rationales in tasks with subjective answers, emphasizing the importance of rationalization in such scenarios. We focus on pairwise argument ranking, a highly subjective task with significant potential for real-world applications, such as debate assistance. We evaluate the persuasiveness of rationales generated by nine LLMs to support their subjective choices. Our findings suggest that open-source LLMs, particularly Llama2-70B-chat, are capable of providing highly persuasive rationalizations, surpassing even GPT models. Additionally, our experiments show that rationale persuasiveness can be improved by controlling its parameters through prompting or through self-refinement.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to evaluate the persuasiveness of free - text justifications generated by large language models (LLMs) in subjective decision - making tasks. Specifically, the research focuses on the pairwise argument ranking task, which is a highly subjective task but has important potential for practical applications, such as debate - aid tools. By analyzing the justifications generated by nine different LLMs, the paper explores how these justifications can convincingly support the subjective choices of the models and raises several research questions: 1. **What are the differences among different LLMs in generating persuasive justifications?** 2. **Can more persuasive justifications be automatically detected?** 3. **Which features of justifications contribute to their persuasiveness?** 4. **Can the persuasiveness of generated justifications be controlled?** To answer these questions, the researchers carried out the following work: - **Dataset**: Two datasets, IBM - 9k and IBM - 30k, were used, from which pairs of arguments were extracted. - **Model**: A variety of open - source and closed - source LLMs were considered, including Llama2, Vicuna, GPT - 3.5 - turbo and GPT4. - **Experimental setup**: Through zero - shot prompting, these models were made to perform pairwise argument ranking and provide supporting justifications. - **Evaluation method**: The generated justifications were evaluated in terms of basic form, content and persuasiveness by a combination of manual annotation and automatic evaluation by GPT4. The main findings of the research include: - **Performance of open - source LLMs**: In particular, Llama2 - 70B - chat performed excellently in generating persuasive justifications, even surpassing GPT4. - **Evaluation ability of GPT4**: GPT4 is highly consistent with human evaluation results in evaluating the persuasiveness of justifications, although there are still differences in some cases. - **Importance of contrastive justifications**: Contrastive justifications (i.e., explaining why the unselected argument is not valid) are a key factor in increasing persuasiveness. - **Impact of prompting strategies**: By adding persuasiveness factors to the prompts, the persuasiveness of the generated justifications can be further enhanced. In conclusion, this paper aims to systematically analyze and evaluate the justifications generated by different LLMs, reveal the persuasiveness of these justifications in subjective decision - making tasks and their influencing factors, and thus provide a theoretical basis for improving the usability and reliability of LLMs in practical applications.