Abstract:Despite the remarkable success of large-scale Language Models (LLMs) such as GPT-3, their performances still significantly underperform fine-tuned models in the task of text classification. This is due to (1) the lack of reasoning ability in addressing complex linguistic phenomena (e.g., intensification, contrast, irony etc); (2) limited number of tokens allowed in in-context learning. In this paper, we introduce \textbf{C}lue \textbf{A}nd \textbf{R}easoning \textbf{P}rompting (CARP). CARP adopts a progressive reasoning strategy tailored to addressing the complex linguistic phenomena involved in text classification: CARP first prompts LLMs to find superficial clues (e.g., keywords, tones, semantic relations, references, etc), based on which a diagnostic reasoning process is induced for final decisions. To further address the limited-token issue, CARP uses a fine-tuned model on the supervised dataset for $k$NN demonstration search in the in-context learning, allowing the model to take the advantage of both LLM's generalization ability and the task-specific evidence provided by the full labeled dataset. Remarkably, CARP yields new SOTA performances on 4 out of 5 widely-used text-classification benchmarks, 97.39 (+1.24) on SST-2, 96.40 (+0.72) on AGNews, 98.78 (+0.25) on R8 and 96.95 (+0.6) on R52, and a performance comparable to SOTA on MR (92.39 v.s. 93.3). More importantly, we find that CARP delivers impressive abilities on low-resource and domain-adaptation setups. Specifically, Specifically, using 16 examples per class, CARP achieves comparable performances to supervised models with 1,024 examples per class.

RoCar: A Relationship Network-based Evaluation Method for Large Language Models

A Survey on Evaluation of Large Language ModelsJust Accepted

A Survey on Evaluation of Large Language Models

Through the Lens of Core Competency: Survey on Evaluation of Large Language Models

Text Classification via Large Language Models

TencentLLMEval: A Hierarchical Evaluation of Real-World Capabilities for Human-Aligned LLMs

LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models

Rethinking Generative Large Language Model Evaluation for Semantic Comprehension

GraphEval2000: Benchmarking and Improving Large Language Models on Graph Datasets

DyVal: Graph-informed Dynamic Evaluation of Large Language Models

Towards Reasoning in Large Language Models via Multi-Agent Peer Review Collaboration

Reasoning in Conversation: Solving Subjective Tasks through Dialogue Simulation for Large Language Models

LLMEval: A Preliminary Study on How to Evaluate Large Language Models

What is the best model? Application-driven Evaluation for Large Language Models

Can Large Language Models be Trusted for Evaluation? Scalable Meta-Evaluation of LLMs as Evaluators via Agent Debate

F-Eval: Asssessing Fundamental Abilities with Refined Evaluation Methods

F-Eval: Assessing Fundamental Abilities with Refined Evaluation Methods

Revisiting the Graph Reasoning Ability of Large Language Models: Case Studies in Translation, Connectivity and Shortest Path

Evaluating Large Language Models on Graphs: Performance Insights and Comparative Analysis

Evaluating Large Language Models at Evaluating Instruction Following