Abstract:Large Language Models (LLMs) have demonstrated impressive capabilities across various domains, prompting a surge in their practical applications. However, concerns have arisen regarding the trustworthiness of LLMs outputs, particularly in closed-book question-answering tasks, where non-experts may struggle to identify inaccuracies due to the absence of contextual or ground truth information. This paper introduces TrustScore, a framework based on the concept of Behavioral Consistency, which evaluates whether an LLMs response aligns with its intrinsic knowledge. Additionally, TrustScore can seamlessly integrate with fact-checking methods, which assesses alignment with external knowledge sources. The experimental results show that TrustScore achieves strong correlations with human judgments, surpassing existing reference-free metrics, and achieving results on par with reference-based metrics.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address the issue of evaluating the trustworthiness of outputs from large language models (LLMs) in closed-book question-answering tasks. Specifically, while LLMs perform excellently in various tasks, in closed-book question-answering tasks, it is difficult for non-expert users to identify inaccurate answers generated by the model due to the lack of context or real information. This leads to concerns about the trustworthiness of LLM outputs. ### Background and Challenges 1. **Capabilities and Applications of LLMs**: - Large language models (LLMs) have shown outstanding performance in natural language processing (NLP) tasks, driving their widespread use in practical applications. - However, these models sometimes generate responses that seem reasonable but are actually incorrect, a problem that is particularly prominent in closed-book question-answering tasks. 2. **Challenges of Closed-Book Question-Answering Tasks**: - In closed-book question-answering tasks, LLMs rely solely on their parameter knowledge to generate answers, without the support of external context or real information. - This makes it very difficult to evaluate the trustworthiness of LLM outputs, especially for non-expert users. ### Solution To address these challenges, the paper introduces the **TrustScore** framework, which is based on the concept of **behavioral consistency** to evaluate whether the LLM's responses are consistent with its internal knowledge. Additionally, TrustScore can seamlessly integrate fact-checking methods to further assess the consistency of responses with external knowledge sources. ### Main Contributions 1. **Behavioral Consistency Evaluation**: - Through multiple-choice tests, evaluate whether the LLM maintains consistent choices in its responses and other distractor options. - If the LLM consistently chooses the same answer across multiple tests, its response is considered consistent with its internal knowledge, thereby increasing trustworthiness. 2. **Fact-Checking Integration**: - When external knowledge bases are available, TrustScore can be combined with fact-checking modules to further verify the accuracy of responses. - This dual approach ensures a comprehensive evaluation of LLM responses, considering both internal consistency and external factual consistency. 3. **Experimental Results**: - Experimental results show that TrustScore has a strong correlation with human judgment, surpassing existing reference-free metrics and performing close to reference-based metrics. ### Conclusion TrustScore provides a novel reference-free evaluation framework that effectively assesses the trustworthiness of LLM responses. The framework performs excellently in closed-book question-answering tasks, not only independently evaluating behavioral consistency but also integrating with fact-checking methods to provide more comprehensive evaluation results.

TrustScore: Reference-Free Evaluation of LLM Response Trustworthiness

Position: TrustLLM: Trustworthiness in Large Language Models

TrustLLM: Trustworthiness in Large Language Models

Enhancing Trust in LLMs: Algorithms for Comparing and Interpreting LLMs

XTRUST: On the Multilingual Trustworthiness of Large Language Models

Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment

Tell me the truth: A system to measure the trustworthiness of Large Language Models

Measuring and Enhancing Trustworthiness of LLMs in RAG through Grounded Attributions and Learning to Refuse

Benchmarking Trustworthiness of Multimodal Large Language Models: A Comprehensive Study

When to Trust LLMs: Aligning Confidence with Response Quality

MultiTrust: A Comprehensive Benchmark Towards Trustworthy Multimodal Large Language Models

Assessing Hidden Risks of LLMs: An Empirical Study on Robustness, Consistency, and Credibility

Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs

Large Language Models Help Humans Verify Truthfulness -- Except When They Are Convincingly Wrong

Factual Confidence of LLMs: on Reliability and Robustness of Current Estimators

TrustGPT: A Benchmark for Trustworthy and Responsible Large Language Models

Are Large Language Models Reliable Judges? A Study on the Factuality Evaluation Capabilities of LLMs

Enhancing Multiple Dimensions of Trustworthiness in LLMs via Sparse Activation Control

On Verbalized Confidence Scores for LLMs

More RLHF, More Trust? On The Impact of Preference Alignment On Trustworthiness

How Reliable are LLMs as Knowledge Bases? Re-thinking Facutality and Consistency