Meta Semantic Template for Evaluation of Large Language Models

Yachuan Liu,Liang Chen,Jindong Wang,Qiaozhu Mei,Xing Xie

2023-10-19

Abstract:Do large language models (LLMs) genuinely understand the semantics of the language, or just memorize the training data? The recent concern on potential data contamination of LLMs has raised awareness of the community to conduct research on LLMs evaluation. In this paper, we propose MSTemp, an approach that creates meta semantic templates to evaluate the semantic understanding ability of LLMs. The core of MSTemp is not to perform evaluation directly on existing benchmark datasets, but to generate new out-of-distribution (OOD) evaluation sets using existing datasets as seeds. Specifically, for a given sentence, MSTemp leverages another language model to generate new samples while preserving its semantics. The new samples are called semantic templates to the original sentence. Then, MSTemp generates evaluation samples via sentence parsing and random word replacement on the semantic templates. MSTemp is highly flexible, dynamic, and cost-effective. Our initial experiments show that MSTemp-generated samples can significantly reduce the performance of LLMs using existing datasets as seeds. We hope this initial work can shed light on future research of LLMs evaluation.

Computation and Language,Artificial Intelligence

What problem does this paper attempt to address?

The problem that this paper attempts to solve is whether large - language models (LLMs) truly understand the semantics of language or merely remember the training data. Specifically, the paper focuses on the fact that the current evaluation of LLMs may be affected by data contamination, that is, LLMs may perform well by remembering the patterns in the training data rather than by truly understanding the semantics. Therefore, the author proposes a method named MST EMP, which aims to more accurately evaluate the semantic understanding ability of LLMs by generating new, out - of - distribution (OOD) evaluation samples. The core idea of MST EMP is not to directly evaluate on the existing benchmark datasets, but to use the existing datasets as seeds to generate new evaluation samples. These new samples reduce the possibility of LLMs relying on memorizing training data by retaining the semantics of the original sentences while introducing diversity and challenges. This method not only improves the flexibility and dynamics of evaluation but also reduces costs, providing new ideas for future LLMs evaluation research.

Meta Semantic Template for Evaluation of Large Language Models

LLMEval: A Preliminary Study on How to Evaluate Large Language Models

Bring Your Own Data! Self-Supervised Evaluation for Large Language Models

Rethinking Generative Large Language Model Evaluation for Semantic Comprehension

LLM Reading Tea Leaves: Automatically Evaluating Topic Models with Large Language Models

Sentiment Analysis in the Era of Large Language Models: A Reality Check

Analyzing the Role of Semantic Representations in the Era of Large Language Models

OLMES: A Standard for Language Model Evaluations

Can Large Language Models be Trusted for Evaluation? Scalable Meta-Evaluation of LLMs as Evaluators via Agent Debate

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

DnA-Eval: Enhancing Large Language Model Evaluation through Decomposition and Aggregation

A Survey on Evaluation of Large Language Models

Leveraging Large Language Models for NLG Evaluation: Advances and Challenges

S3Eval: A Synthetic, Scalable, Systematic Evaluation Suite for Large Language Model

MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models

S3Eval: A Synthetic, Scalable, Systematic Evaluation Suite for Large Language Models

Surveying the MLLM Landscape: A Meta-Review of Current Surveys

A Survey on Evaluation of Large Language ModelsJust Accepted

StructEval: Deepen and Broaden Large Language Model Assessment via Structured Evaluation

Dynamic Evaluation of Large Language Models by Meta Probing Agents