Abstract:The recent popularity of large language models (LLMs) has brought a significant impact to boundless fields, particularly through their open-ended ecosystem such as the APIs, open-sourced models, and plugins. However, with their widespread deployment, there is a general lack of research that thoroughly discusses and analyzes the potential risks concealed. In that case, we intend to conduct a preliminary but pioneering study covering the robustness, consistency, and credibility of LLMs systems. With most of the related literature in the era of LLM uncharted, we propose an automated workflow that copes with an upscaled number of queries/responses. Overall, we conduct over a million queries to the mainstream LLMs including ChatGPT, LLaMA, and OPT. Core to our workflow consists of a data primitive, followed by an automated interpreter that evaluates these LLMs under different adversarial metrical systems. As a result, we draw several, and perhaps unfortunate, conclusions that are quite uncommon from this trendy community. Briefly, they are: (i)-the minor but inevitable error occurrence in the user-generated query input may, by chance, cause the LLM to respond unexpectedly; (ii)-LLMs possess poor consistency when processing semantically similar query input. In addition, as a side finding, we find that ChatGPT is still capable to yield the correct answer even when the input is polluted at an extreme level. While this phenomenon demonstrates the powerful memorization of the LLMs, it raises serious concerns about using such data for LLM-involved evaluation in academic development. To deal with it, we propose a novel index associated with a dataset that roughly decides the feasibility of using such data for LLM-involved evaluation. Extensive empirical studies are tagged to support the aforementioned claims.

How Reliable are LLMs as Knowledge Bases? Re-thinking Facutality and Consistency

Survey on Factuality in Large Language Models: Knowledge, Retrieval and Domain-Specificity

Assessing the Reliability of Large Language Model Knowledge

Factual Confidence of LLMs: on Reliability and Robustness of Current Estimators

Are Large Language Models Reliable Judges? A Study on the Factuality Evaluation Capabilities of LLMs

Head-to-Tail: How Knowledgeable are Large Language Models (LLMs)? A.K.A. Will LLMs Replace Knowledge Graphs?

Statistical Knowledge Assessment for Large Language Models

Assessing Hidden Risks of LLMs: An Empirical Study on Robustness, Consistency, and Credibility

Can Language Models Act as Knowledge Bases at Scale?

Examining LLMs' Uncertainty Expression Towards Questions Outside Parametric Knowledge

A Survey on LLM-as-a-Judge

Towards Verifiable Generation: A Benchmark for Knowledge-aware Language Model Attribution

Evaluating Language Models for Knowledge Base Completion

The Factuality of Large Language Models in the Legal Domain

Beyond Factuality: A Comprehensive Evaluation of Large Language Models as Knowledge Generators

Multi-Model Consistency for LLMs’ Evaluation

Understanding the Role of LLMs in Multimodal Evaluation Benchmarks

RECALL: A Benchmark for LLMs Robustness against External Counterfactual Knowledge

Knowledge-based Consistency Testing of Large Language Models

How Proficient Are Large Language Models in Formal Languages? An In-Depth Insight for Knowledge Base Question Answering

Can We Trust LLMs? Mitigate Overconfidence Bias in LLMs through Knowledge Transfer