Abstract:Large Language Models for Code (code LLMs) have demonstrated remarkable performance across various software engineering (SE) tasks, increasing the application of code LLMs in software development. Despite the success of code LLMs, there remain significant concerns about the actual capabilities and reliability of these models, "whether these models really learn the semantics of code from the training data and leverage the learned knowledge to perform the SE tasks". In this paper, we introduce EMPICA, a comprehensive framework designed to systematically and empirically evaluate the capabilities of code LLMs in understanding code semantics. Specifically, EMPICA systematically introduces controlled modifications/transformations into the input code and examines the models' responses. Generally, code LLMs must be robust to semantically equivalent code inputs and be sensitive to non-equivalent ones for all SE tasks. Specifically, for every SE task, given an input code snippet c and its semantic equivalent variants, code LLMs must robustly produce consistent/equivalent outputs while they are expected to generate different outputs for c and its semantic non-equivalent variants. Our experimental results on three representative code understanding tasks, including code summarization, method name prediction, and output prediction, reveal that the robustness and sensitivity of the state-of-the-art code LLMs to code transformations vary significantly across tasks and transformation operators. In addition, the code LLMs exhibit better robustness to the semantic preserving transformations than their sensitivity to the semantic non-preserving transformations. These results highlight a need to enhance the model's capabilities of understanding code semantics, especially the sensitivity property.

Quantifying Semantic Emergence in Language Models

Improving Uncertainty Quantification in Large Language Models via Semantic Embeddings

Subspace Chronicles: How Linguistic Information Emerges, Shifts and Interacts during Language Model Training

A Quantum Expectation Value Based Language Model with Application to Question Answering

Fantastic Semantics and Where to Find Them: Investigating Which Layers of Generative LLMs Reflect Lexical Semantics

Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation

Emergent Representations of Program Semantics in Language Models Trained on Programs

Large Language Models are Easily Confused: A Quantitative Metric, Security Implications and Typological Analysis

Semantic Density: Uncertainty Quantification for Large Language Models through Confidence Measurement in Semantic Space

Kernel Language Entropy: Fine-grained Uncertainty Quantification for LLMs from Semantic Similarities

Tomato, Tomahto, Tomate: Measuring the Role of Shared Semantics among Subwords in Multilingual Language Models

Language Models As Semantic Indexers

An Empirical Study on Capability of Large Language Models in Understanding Code Semantics

Beyond the Veil of Similarity: Quantifying Semantic Continuity in Explainable AI

Large Language Models Are In-Context Semantic Reasoners Rather Than Symbolic Reasoners

Semantic Sensitivities and Inconsistent Predictions: Measuring the Fragility of NLI Models

Meta Semantic Template for Evaluation of Large Language Models

The Information of Large Language Model Geometry

Quantitative knowledge retrieval from large language models

Voluminous yet Vacuous? Semantic Capital in an Age of Large Language Models

Uncertainty Quantification for In-Context Learning of Large Language Models