Abstract:Generating free-text rationales is among the emergent capabilities of Large Language Models (LLMs). These rationales have been found to enhance LLM performance across various NLP tasks. Recently, there has been growing interest in using these rationales to provide insights for various important downstream tasks. In this paper, we analyze generated free-text rationales in tasks with subjective answers, emphasizing the importance of rationalization in such scenarios. We focus on pairwise argument ranking, a highly subjective task with significant potential for real-world applications, such as debate assistance. We evaluate the persuasiveness of rationales generated by nine LLMs to support their subjective choices. Our findings suggest that open-source LLMs, particularly Llama2-70B-chat, are capable of providing highly persuasive rationalizations, surpassing even GPT models. Additionally, our experiments show that rationale persuasiveness can be improved by controlling its parameters through prompting or through self-refinement.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to evaluate the persuasiveness of free - text justifications generated by large language models (LLMs) in subjective decision - making tasks. Specifically, the research focuses on the pairwise argument ranking task, which is a highly subjective task but has important potential for practical applications, such as debate - aid tools. By analyzing the justifications generated by nine different LLMs, the paper explores how these justifications can convincingly support the subjective choices of the models and raises several research questions: 1. **What are the differences among different LLMs in generating persuasive justifications?** 2. **Can more persuasive justifications be automatically detected?** 3. **Which features of justifications contribute to their persuasiveness?** 4. **Can the persuasiveness of generated justifications be controlled?** To answer these questions, the researchers carried out the following work: - **Dataset**: Two datasets, IBM - 9k and IBM - 30k, were used, from which pairs of arguments were extracted. - **Model**: A variety of open - source and closed - source LLMs were considered, including Llama2, Vicuna, GPT - 3.5 - turbo and GPT4. - **Experimental setup**: Through zero - shot prompting, these models were made to perform pairwise argument ranking and provide supporting justifications. - **Evaluation method**: The generated justifications were evaluated in terms of basic form, content and persuasiveness by a combination of manual annotation and automatic evaluation by GPT4. The main findings of the research include: - **Performance of open - source LLMs**: In particular, Llama2 - 70B - chat performed excellently in generating persuasive justifications, even surpassing GPT4. - **Evaluation ability of GPT4**: GPT4 is highly consistent with human evaluation results in evaluating the persuasiveness of justifications, although there are still differences in some cases. - **Importance of contrastive justifications**: Contrastive justifications (i.e., explaining why the unselected argument is not valid) are a key factor in increasing persuasiveness. - **Impact of prompting strategies**: By adding persuasiveness factors to the prompts, the persuasiveness of the generated justifications can be further enhanced. In conclusion, this paper aims to systematically analyze and evaluate the justifications generated by different LLMs, reveal the persuasiveness of these justifications in subjective decision - making tasks and their influencing factors, and thus provide a theoretical basis for improving the usability and reliability of LLMs in practical applications.

Persuasiveness of Generated Free-Text Rationales in Subjective Decisions: A Case Study on Pairwise Argument Ranking

Free-text Rationale Generation under Readability Level Control

Tailoring Self-Rationalizers with Multi-Reward Distillation

Self-rationalization improves LLM as a fine-grained judge

Natural Language Rationales with Full-Stack Visual Reasoning: From Pixels to Semantic Frames to Commonsense Graphs

Can Language Models Recognize Convincing Arguments?

Calibrating LLMs with Preference Optimization on Thought Trees for Generating Rationale in Science Question Scoring

Reasoning in Conversation: Solving Subjective Tasks through Dialogue Simulation for Large Language Models

Leveraging Machine-Generated Rationales to Facilitate Social Meaning Detection in Conversations

"I'd Like to Have an Argument, Please": Argumentative Reasoning in Large Language Models

FRAME: Evaluating Rationale-Label Consistency Metrics for Free-Text Rationales

Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form Text

Evaluating Human Alignment and Model Faithfulness of LLM Rationale

AERA Chat: An Interactive Platform for Automated Explainable Student Answer Assessment

Measuring and Benchmarking Large Language Models' Capabilities to Generate Persuasive Language

Large Language Models are as persuasive as humans, but how? About the cognitive effort and moral-emotional language of LLM arguments

Can formal argumentative reasoning enhance LLMs performances?

RORA: Robust Free-Text Rationale Evaluation

Answering Unseen Questions With Smaller Language Models Using Rationale Generation and Dense Retrieval

Ranking Over Scoring: Towards Reliable and Robust Automated Evaluation of LLM-Generated Medical Explanatory Arguments

Improving Language Model Reasoning with Self-motivated Learning