Abstract:LLMs have demonstrated remarkable capability for understanding semantics, but they often struggle with understanding pragmatics. To demonstrate this fact, we release a Pragmatics Understanding Benchmark (PUB) dataset consisting of fourteen tasks in four pragmatics phenomena, namely, Implicature, Presupposition, Reference, and Deixis. We curated high-quality test sets for each task, consisting of Multiple Choice Question Answers (MCQA). PUB includes a total of 28k data points, 6.1k of which have been created by us, and the rest are adapted from existing datasets. We evaluated nine models varying in the number of parameters and type of training. Our study indicates that fine-tuning for instruction-following and chat significantly enhances the pragmatics capabilities of smaller language models. However, for larger models, the base versions perform comparably with their chat-adapted counterparts. Additionally, there is a noticeable performance gap between human capabilities and model capabilities. Furthermore, unlike the consistent performance of humans across various tasks, the models demonstrate variability in their proficiency, with performance levels fluctuating due to different hints and the complexities of tasks within the same dataset. Overall, the benchmark aims to provide a comprehensive evaluation of LLM's ability to handle real-world language tasks that require pragmatic reasoning.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: the insufficient ability of large - language models (LLMs) in understanding pragmatics. Specifically, although LLMs perform excellently in semantic understanding, they have difficulties in dealing with pragmatic phenomena. To verify this problem, the authors constructed a benchmark dataset named PUB (Pragmatics Understanding Benchmark), aiming to evaluate the performance of LLMs in four main pragmatic phenomena (implicature, presupposition, reference, and deixis). ### Main research questions: 1. **To what extent can LLMs understand human intentions in conversations?** - Through a series of tasks (such as direct/indirect classification, implicature recovery, etc.), evaluate whether LLMs can correctly understand the implicit meanings and intentions in conversations. 2. **Is there a correlation between the scale of the model and its pragmatic ability?** - Research the performance differences of LLMs of different scales in handling pragmatic tasks, and explore the impact of model scale on pragmatic understanding. 3. **Do LLMs optimized for conversation scenarios show stronger pragmatic abilities?** - Compare the performance of conversation - optimized LLMs and basic LLMs in pragmatic tasks, and evaluate the improvement effect of conversation optimization on pragmatic understanding. 4. **Even on the same dataset, will LLMs show different task sensitivities?** - Explore the performance fluctuations of LLMs in different task settings, especially when the prompt words or task order changes, the change in model performance. 5. **How does the pragmatic ability of LLMs compare with that of humans?** - By comparing the performance of humans and LLMs in the same tasks, reveal the gap between the two, and analyze the advantages and disadvantages of LLMs. ### Characteristics of the PUB dataset: - **Covering four major pragmatic phenomena**: Implicature, Presupposition, Reference, Deixis. - **Including 14 tasks**: Each task is designed as a multiple - choice question (MCQA) to better simulate the question - and - answer situations in conversations. - **Rich in data volume**: It contains a total of 28,000 data points, of which 6,100 are newly annotated data, and the rest are from existing datasets. - **Diverse evaluation**: Comprehensively evaluate model performance through multiple evaluation methods (such as Cloze Prompting and Multiple Choice Prompting). ### Main contributions: 1. Provide a comprehensive and unified benchmark dataset covering 14 different pragmatic tasks. 2. Systematically evaluate the performance of multiple LLMs on these tasks. 3. Reveal the gap between LLMs and humans in pragmatic understanding through human evaluation. 4. Provide in - depth insights into the pragmatic ability of LLMs, helping researchers improve the interaction ability of LLMs. In conclusion, this paper systematically evaluates the ability of LLMs in pragmatic understanding by constructing the PUB dataset, reveals the limitations of current LLMs, and provides valuable references for future research.

PUB: A Pragmatics Understanding Benchmark for Assessing LLMs' Pragmatics Capabilities

Do LLMs Understand Social Knowledge? Evaluating the Sociability of Large Language Models with SocKET Benchmark

Rethinking Pragmatics in Large Language Models: Towards Open-Ended Evaluation and Preference Tuning

Strategic Prompting for Conversational Tasks: A Comparative Analysis of Large Language Models Across Diverse Conversational Tasks

A Pragmatics-Centered Evaluation Framework for Natural Language Understanding

Evaluating statistical language models as pragmatic reasoners

On the Benchmarking of LLMs for Open-Domain Dialogue Evaluation

MultiPragEval: Multilingual Pragmatic Evaluation of Large Language Models

Developing a Pragmatic Benchmark for Assessing Korean Legal Language Understanding in Large Language Models

PUB: Plot Understanding Benchmark and Dataset for Evaluating Large Language Models on Synthetic Visual Data Interpretation

Beyond Metrics: Evaluating LLMs' Effectiveness in Culturally Nuanced, Low-Resource Real-World Scenarios

A fine-grained comparison of pragmatic language understanding in humans and language models

GPT-4 Surpassing Human Performance in Linguistic Pragmatics

Pragmatic competence of pre-trained language models through the lens of discourse connectives

Judging LLM-as-a-judge with MT-Bench and Chatbot Arena

clembench-2024: A Challenging, Dynamic, Complementary, Multilingual Benchmark and Underlying Flexible Framework for LLMs as Multi-Action Agents

PLUGH: A Benchmark for Spatial Understanding and Reasoning in Large Language Models

Towards an Analysis of Discourse and Interactional Pragmatic Reasoning Capabilities of Large Language Models

LogicBench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models

State of What Art? A Call for Multi-Prompt LLM Evaluation