Abstract:Over the past decade, Artificial Intelligence (AI) has had great success recently and is being used in a wide range of academic and industrial fields. More recently, Large Language Models (LLMs) have made rapid advancements that have propelled AI to a new level, enabling and empowering even more diverse applications and industrial domains with intelligence, particularly in areas like software engineering and natural language processing. Nevertheless, a number of emerging trustworthiness concerns and issues exhibited in LLMs, e.g., robustness and hallucination, have already recently received much attention, without properly solving which the widespread adoption of LLMs could be greatly hindered in practice. The distinctive characteristics of LLMs, such as the self-attention mechanism, extremely large neural network scale, and autoregressive generation usage contexts, differ from classic AI software based on Convolutional Neural Networks and Recurrent Neural Networks and present new challenges for quality analysis. Up to the present, it still lacks universal and systematic analysis techniques for LLMs despite the urgent industrial demand across diverse domains. Towards bridging such a gap, in this paper, we initiate an early exploratory study and propose a universal analysis framework for LLMs, named LUNA, which is designed to be general and extensible and enables versatile analysis of LLMs from multiple quality perspectives in a human-interpretable manner. In particular, we first leverage the data from desired trustworthiness perspectives to construct an abstract model as an auxiliary analysis asset and proxy, which is empowered by various abstract model construction methods built-in LUNA. To assess the quality of the abstract model, we collect and define a number of evaluation metrics, aiming at both the abstract model level and the semantics level. Then, the semantics, which is the degree of satisfaction of the LLM w.r.t. the trustworthiness perspective, is bound to and enriches the abstract model with semantics, which enables more detailed analysis applications for diverse purposes, e.g., abnormal behavior detection. To better understand the potential usefulness of our analysis framework LUNA, we conduct a large-scale evaluation, the results of which demonstrate that 1) the abstract model has the potential to distinguish normal and abnormal behavior in LLM, 2) LUNA is effective for the real-world analysis of LLMs in practice, and the hyperparameter settings influence the performance, 3) different evaluation metrics are in different correlations with the analysis performance. In order to encourage further studies in the quality assurance of LLMs, we made all of the code and more detailed experimental results data available on the supplementary website of this paper https://sites.google.com/view/llm-luna.

LUNA: A Framework for Language Understanding and Naturalness Assessment

LUNA: A Model-Based Universal Analysis Framework for Large Language Models

Unveiling LLM Evaluation Focused on Metrics: Challenges and Solutions

Leveraging Large Language Models for NLG Evaluation: Advances and Challenges

Rethinking the Evaluating Framework for Natural Language Understanding in AI Systems: Language Acquisition as a Core for Future Metrics

Evaluating Evaluation Metrics: A Framework for Analyzing NLG Evaluation Metrics using Measurement Theory

Compression, Transduction, and Creation: A Unified Framework for Evaluating Natural Language Generation

LLM-based NLG Evaluation: Current Status and Challenges

Toward Human-Like Evaluation for Natural Language Generation with Error Analysis

Exploring Precision and Recall to assess the quality and diversity of LLMs

Luna: An Evaluation Foundation Model to Catch Language Model Hallucinations with High Accuracy and Low Cost

Unified Language Model Pre-training for Natural Language Understanding and Generation

LUNA: Language Understanding with Number Augmentations on Transformers via Number Plugins and Pre-training

LMUnit: Fine-grained Evaluation with Natural Language Unit Tests

Unveiling the Achilles' Heel of NLG Evaluators: A Unified Adversarial Framework Driven by Large Language Models

Not All Metrics Are Guilty: Improving NLG Evaluation with LLM Paraphrasing

Data-driven Natural Language Generation: Paving the Road to Success

A Systematic Evaluation of Large Language Models for Natural Language Generation Tasks

NLG Evaluation Metrics Beyond Correlation Analysis: An Empirical Metric Preference Checklist