Abstract:Large Language models have achieved impressive performance in automated software engineering. Extensive efforts have been made to evaluate the abilities of code LLMs in various aspects, with an increasing number of benchmarks and evaluation frameworks proposed. Apart from the most sought-after capability of code generation, the capability of code comprehension is being granted growing attention. Nevertheless, existing works assessing the code comprehension capability of LLMs exhibit varied limitations. Evaluation frameworks like CRUXEval and REval usually focus on code reasoning tasks over a certain input case, leading to a limited range of execution traces covered, resulting in a loss in code semantics examined and the inability to assess the comprehensive understanding of LLMs concerning the target program. To tackle the challenges above, we propose SpecEval, a novel black-box evaluation framework to evaluate code comprehension in LLMs via program specifications. Inspired by the idea that specifications can comprehensively articulate program behaviors concerning all possible execution traces, we employ formal specifications to represent program semantics and perform thorough evaluations. In particular, four specification-related tasks are designed to assess the capability of LLMs from basic to advanced levels. Moreover, counterfactual analysis is conducted to study the performance variance of LLMs under semantics-preserving perturbations, and progressive consistency analysis is performed to study the performance consistency of LLMs over a series of tasks with sequential dependence. Systematic experiments are conducted on six state-of-the-art LLMs. Experimental results present a below-satisfactory performance of LLMs on specification-related tasks, revealing the limitations of existing LLMs in articulating program semantics, underscoring future directions for enhancement.

SailCompass: Towards Reproducible and Robust Evaluation for Southeast Asian Languages

SeaLLMs -- Large Language Models for Southeast Asia

Compass: Large Multilingual Language Model for South-east Asia

BHASA: A Holistic Southeast Asian Linguistic and Cultural Evaluation Suite for Large Language Models

Sailor: Open Language Models for South-East Asia

SeaLLMs 3: Open Foundation and Chat Multilingual Large Language Models for Southeast Asian Languages

How Well Do LLMs Handle Cantonese? Benchmarking Cantonese Capabilities of Large Language Models

OpenEval: Benchmarking Chinese LLMs across Capability, Alignment and Safety

S3Eval: A Synthetic, Scalable, Systematic Evaluation Suite for Large Language Model

S3Eval: A Synthetic, Scalable, Systematic Evaluation Suite for Large Language Models

MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models

SpecEval: Evaluating Code Comprehension in Large Language Models via Program Specifications

SeaEval for Multilingual Foundation Models: From Cross-Lingual Alignment to Cultural Reasoning

Crossing Linguistic Horizons: Finetuning and Comprehensive Evaluation of Vietnamese Large Language Models

CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution

A Survey on Evaluation of Large Language ModelsJust Accepted

GameEval: Evaluating LLMs on Conversational Games

LogEval: A Comprehensive Benchmark Suite for Large Language Models In Log Analysis

SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific Research

CJEval: A Benchmark for Assessing Large Language Models Using Chinese Junior High School Exam Data

CodeApex: A Bilingual Programming Evaluation Benchmark for Large Language Models