Abstract:The rapid development of large language models (LLMs) has shown promising practical results. However, their low interpretability often leads to errors in unforeseen circumstances, limiting their utility. Many works have focused on creating comprehensive evaluation systems, but previous benchmarks have primarily assessed problem-solving abilities while neglecting the response's uncertainty, which may result in unreliability. Recent methods for measuring LLM reliability are resource-intensive and unable to test black-box models. To address this, we propose UBENCH, a comprehensive benchmark for evaluating LLM reliability. UBENCH includes 3,978 multiple-choice questions covering knowledge, language, understanding, and reasoning abilities. Experimental results show that UBENCH has achieved state-of-the-art performance, while its single-sampling method significantly saves computational resources compared to baseline methods that require multiple samplings. Additionally, based on UBENCH, we evaluate the reliability of 15 popular LLMs, finding GLM4 to be the most outstanding, closely followed by GPT-4. We also explore the impact of Chain-of-Thought prompts, role-playing prompts, option order, and temperature on LLM reliability, analyzing the varying effects on different LLMs.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address the issue of reliability assessment of large language models (LLMs) when faced with uncertainty. Although LLMs perform excellently on many tasks, their internal mechanisms' opacity leads to low interpretability, making the models prone to errors in certain situations, such as hallucinations, biases, or misinformation. Therefore, in addition to evaluating the model's accuracy, it is also necessary to assess the confidence of its outputs to decide whether to trust the information or suggestions provided by the model. ### Main Contributions 1. **Proposing UBENCH**: This is a new systematic and automated uncertainty evaluation benchmark, containing 3,978 multiple-choice questions covering four categories: knowledge, language, understanding, and reasoning abilities. 2. **Performance Comparison**: Compared to existing uncertainty evaluation methods, UBENCH achieves the best performance on multiple metrics and requires only a single sampling, significantly reducing computational costs. 3. **Model Testing**: Using UBENCH, 15 popular LLMs were tested, finding that GLM4 performed the best, followed by GPT-4 and Llama3. The reliability of open-source and closed-source models is generally comparable, and performance tends to improve with model upgrades. 4. **Influence Factor Analysis**: Explored the different impacts of chain-of-thought (CoT), role-playing prompts, option order, and temperature parameters on the reliability of LLMs, providing detailed analysis and explanations. ### Background and Motivation - **Limitations of Existing Benchmarks**: Previous benchmarks mainly focused on the accuracy of models, neglecting the confidence of outputs, which could lead to misunderstandings or even harm. - **Importance of Uncertainty Estimation**: Uncertainty estimation, as an effective risk assessment method, can reflect the calibration of the model and provide a basis for understanding the reliability of model responses. - **Challenges of Traditional Methods**: Traditional uncertainty estimation methods face difficulties in obtaining training data and intermediate outputs, especially for closed-source models. Therefore, it is necessary to explore new methods and benchmarks suitable for LLMs. ### Methods and Experiments - **Data Construction**: Randomly sampled from multiple public datasets, processed, and quality controlled to form positive and negative samples. - **Prompt Design**: Designed prompts using a zero-shot approach, including role-playing prompts, task statements, and step-by-step problem decomposition. - **Evaluation Metrics**: Used expected calibration error (ECE), average calibration error (ACE), maximum calibration error (MCE), and thresholded average calibration error (TACE) to evaluate the reliability of the models. - **Experimental Setup**: Tested on 15 popular LLMs, including GPT-4, GLM4, Llama3, etc. ### Experimental Results - **Overall Performance**: GLM4 performed the best on all metrics, followed by GPT-4 and Llama3. Early LLMs like Baichuan2 and Llama2 were less reliable than later models like GLM4 and Llama3. - **Influence Factors**: CoT can reduce ECE but increase MCE, role-playing prompts have inconsistent effects on different models, changes in option order have a significant impact on GPT-4, and temperature parameters also affect the reliability of the models. ### Conclusion By proposing UBENCH, this paper provides a comprehensive, accurate, and efficient uncertainty evaluation benchmark, helping to better understand and improve the reliability of LLMs. These findings are of significant importance to LLM researchers and the improvement of evaluation systems.

UBENCH: Benchmarking Uncertainty in Large Language Models with Multiple Choice Questions

SafetyBench: Evaluating the Safety of Large Language Models with Multiple Choice Questions

Benchmarking LLMs via Uncertainty Quantification

Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph

Examining LLMs' Uncertainty Expression Towards Questions Outside Parametric Knowledge

Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs

SafetyBench: Evaluating the Safety of Large Language Models

Open-LLM-Leaderboard: From Multi-choice to Open-style Questions for LLMs Evaluation, Benchmark, and Arena

Improving the Reliability of Large Language Models by Leveraging Uncertainty-Aware In-Context Learning

CLR-Bench: Evaluating Large Language Models in College-level Reasoning

Benchmarking Large Language Model Uncertainty for Prompt Optimization

Beyond Probabilities: Unveiling the Misalignment in Evaluating Large Language Models

TUBench: Benchmarking Large Vision-Language Models on Trustworthiness with Unanswerable Questions

Generating with Confidence: Uncertainty Quantification for Black-box Large Language Models

Uncertainty Estimation of Large Language Models in Medical Question Answering

Look Before You Leap: An Exploratory Study of Uncertainty Measurement for Large Language Models

NLPBench: Evaluating Large Language Models on Solving NLP Problems

Towards Reproducible LLM Evaluation: Quantifying Uncertainty in LLM Benchmark Scores

Assessing Hidden Risks of LLMs: An Empirical Study on Robustness, Consistency, and Credibility

A User-Centric Benchmark for Evaluating Large Language Models.

Confidence Under the Hood: An Investigation into the Confidence-Probability Alignment in Large Language Models