Abstract:As an indispensable ingredient of intelligence, commonsense reasoning is crucial for large language models (LLMs) in real-world scenarios. In this paper, we propose CORECODE, a dataset that contains abundant commonsense knowledge manually annotated on dyadic dialogues, to evaluate the commonsense reasoning and commonsense conflict detection capabilities of Chinese LLMs. We categorize commonsense knowledge in everyday conversations into three dimensions: entity, event, and social interaction. For easy and consistent annotation, we standardize the form of commonsense knowledge annotation in open-domain dialogues as "domain: slot = value". A total of 9 domains and 37 slots are defined to capture diverse commonsense knowledge. With these pre-defined domains and slots, we collect 76,787 commonsense knowledge annotations from 19,700 dialogues through crowdsourcing. To evaluate and enhance the commonsense reasoning capability for LLMs on the curated dataset, we establish a series of dialogue-level reasoning and detection tasks, including commonsense knowledge filling, commonsense knowledge generation, commonsense conflict phrase detection, domain identification, slot identification, and event causal inference. A wide variety of existing open-source Chinese LLMs are evaluated with these tasks on our dataset. Experimental results demonstrate that these models are not competent to predict CORECODE's plentiful reasoning content, and even ChatGPT could only achieve 0.275 and 0.084 accuracy on the domain identification and slot identification tasks under the zero-shot setting. We release the data and codes of CORECODE at <a class="link-external link-https" href="https://github.com/danshi777/CORECODE" rel="external noopener nofollow">this https URL</a> to promote commonsense reasoning evaluation and study of LLMs in the context of daily conversations.

Robust Commonsense Reasoning Against Noisy Labels Using Adaptive Correction

Two Wrongs Don't Make a Right: Combating Confirmation Bias in Learning with Label Noise.

Can Language Models Perform Robust Reasoning in Chain-of-thought Prompting with Noisy Rationales?

What Really is Commonsense Knowledge?

CORECODE: A Common Sense Annotated Dialogue Dataset with Benchmark Tasks for Chinese Large Language Models

Automatic Prompt Augmentation and Selection with Chain-of-Thought from Labeled Data

The Devil is in the Labels: Noisy Label Correction for Robust Scene Graph Generation

LINKED: Eliciting, Filtering and Integrating Knowledge in Large Language Model for Commonsense Reasoning

CIKQA: Learning Commonsense Inference with a Unified Knowledge-in-the-loop QA Paradigm

Knowledge-Driven CoT: Exploring Faithful Reasoning in LLMs for Knowledge-intensive Question Answering

A noise audit of human-labeled benchmarks for machine commonsense reasoning

A Graph-Guided Reasoning Approach for Open-ended Commonsense Question Answering

Error-Bounded Correction of Noisy Labels

Evaluating Commonsense in Pre-trained Language Models

Focus on Your Question! Interpreting and Mitigating Toxic CoT Problems in Commonsense Reasoning

Noisy Positive-Unlabeled Learning with Self-Training for Speculative Knowledge Graph Reasoning

Right for Right Reasons: Large Language Models for Verifiable Commonsense Knowledge Graph Question Answering

Knowledge-aware adaptive graph network for commonsense question answering

Learning Visual Question Answering on Controlled Semantic Noisy Labels

Improving Commonsense in Vision-Language Models via Knowledge Graph Riddles

Joint Reasoning for Multi-Faceted Commonsense Knowledge