Settaluri Lakshmi Sravanthi,Meet Doshi,Tankala Pavan Kalyan,Rudra Murthy,Pushpak Bhattacharyya,Raj Dabre
Abstract:LLMs have demonstrated remarkable capability for understanding semantics, but they often struggle with understanding pragmatics. To demonstrate this fact, we release a Pragmatics Understanding Benchmark (PUB) dataset consisting of fourteen tasks in four pragmatics phenomena, namely, Implicature, Presupposition, Reference, and Deixis. We curated high-quality test sets for each task, consisting of Multiple Choice Question Answers (MCQA). PUB includes a total of 28k data points, 6.1k of which have been created by us, and the rest are adapted from existing datasets. We evaluated nine models varying in the number of parameters and type of training. Our study indicates that fine-tuning for instruction-following and chat significantly enhances the pragmatics capabilities of smaller language models. However, for larger models, the base versions perform comparably with their chat-adapted counterparts. Additionally, there is a noticeable performance gap between human capabilities and model capabilities. Furthermore, unlike the consistent performance of humans across various tasks, the models demonstrate variability in their proficiency, with performance levels fluctuating due to different hints and the complexities of tasks within the same dataset. Overall, the benchmark aims to provide a comprehensive evaluation of LLM's ability to handle real-world language tasks that require pragmatic reasoning.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: the insufficient ability of large - language models (LLMs) in understanding pragmatics. Specifically, although LLMs perform excellently in semantic understanding, they have difficulties in dealing with pragmatic phenomena. To verify this problem, the authors constructed a benchmark dataset named PUB (Pragmatics Understanding Benchmark), aiming to evaluate the performance of LLMs in four main pragmatic phenomena (implicature, presupposition, reference, and deixis).
### Main research questions:
1. **To what extent can LLMs understand human intentions in conversations?**
- Through a series of tasks (such as direct/indirect classification, implicature recovery, etc.), evaluate whether LLMs can correctly understand the implicit meanings and intentions in conversations.
2. **Is there a correlation between the scale of the model and its pragmatic ability?**
- Research the performance differences of LLMs of different scales in handling pragmatic tasks, and explore the impact of model scale on pragmatic understanding.
3. **Do LLMs optimized for conversation scenarios show stronger pragmatic abilities?**
- Compare the performance of conversation - optimized LLMs and basic LLMs in pragmatic tasks, and evaluate the improvement effect of conversation optimization on pragmatic understanding.
4. **Even on the same dataset, will LLMs show different task sensitivities?**
- Explore the performance fluctuations of LLMs in different task settings, especially when the prompt words or task order changes, the change in model performance.
5. **How does the pragmatic ability of LLMs compare with that of humans?**
- By comparing the performance of humans and LLMs in the same tasks, reveal the gap between the two, and analyze the advantages and disadvantages of LLMs.
### Characteristics of the PUB dataset:
- **Covering four major pragmatic phenomena**: Implicature, Presupposition, Reference, Deixis.
- **Including 14 tasks**: Each task is designed as a multiple - choice question (MCQA) to better simulate the question - and - answer situations in conversations.
- **Rich in data volume**: It contains a total of 28,000 data points, of which 6,100 are newly annotated data, and the rest are from existing datasets.
- **Diverse evaluation**: Comprehensively evaluate model performance through multiple evaluation methods (such as Cloze Prompting and Multiple Choice Prompting).
### Main contributions:
1. Provide a comprehensive and unified benchmark dataset covering 14 different pragmatic tasks.
2. Systematically evaluate the performance of multiple LLMs on these tasks.
3. Reveal the gap between LLMs and humans in pragmatic understanding through human evaluation.
4. Provide in - depth insights into the pragmatic ability of LLMs, helping researchers improve the interaction ability of LLMs.
In conclusion, this paper systematically evaluates the ability of LLMs in pragmatic understanding by constructing the PUB dataset, reveals the limitations of current LLMs, and provides valuable references for future research.