Abstract:The recent popularity of large language models (LLMs) has brought a significant impact to boundless fields, particularly through their open-ended ecosystem such as the APIs, open-sourced models, and plugins. However, with their widespread deployment, there is a general lack of research that thoroughly discusses and analyzes the potential risks concealed. In that case, we intend to conduct a preliminary but pioneering study covering the robustness, consistency, and credibility of LLMs systems. With most of the related literature in the era of LLM uncharted, we propose an automated workflow that copes with an upscaled number of queries/responses. Overall, we conduct over a million queries to the mainstream LLMs including ChatGPT, LLaMA, and OPT. Core to our workflow consists of a data primitive, followed by an automated interpreter that evaluates these LLMs under different adversarial metrical systems. As a result, we draw several, and perhaps unfortunate, conclusions that are quite uncommon from this trendy community. Briefly, they are: (i)-the minor but inevitable error occurrence in the user-generated query input may, by chance, cause the LLM to respond unexpectedly; (ii)-LLMs possess poor consistency when processing semantically similar query input. In addition, as a side finding, we find that ChatGPT is still capable to yield the correct answer even when the input is polluted at an extreme level. While this phenomenon demonstrates the powerful memorization of the LLMs, it raises serious concerns about using such data for LLM-involved evaluation in academic development. To deal with it, we propose a novel index associated with a dataset that roughly decides the feasibility of using such data for LLM-involved evaluation. Extensive empirical studies are tagged to support the aforementioned claims.

Asking Again and Again: Exploring LLM Robustness to Repeated Questions

Leveraging Large Language Models for Multiple Choice Question Answering

Ask Again, Then Fail: Large Language Models' Vacillations in Judgment

Large Language Models Sensitivity to The Order of Options in Multiple-Choice Questions

Do LLMs Know When to NOT Answer? Investigating Abstention Abilities of Large Language Models

Evaluating Concurrent Robustness of Language Models Across Diverse Challenge Sets

Do Large Language Models Exhibit Cognitive Dissonance? Studying the Difference Between Revealed Beliefs and Stated Answers

I Could've Asked That: Reformulating Unanswerable Questions

Noisy Exemplars Make Large Language Models More Robust: A Domain-Agnostic Behavioral Analysis

Don't Trust ChatGPT when Your Question is not in English: A Study of Multilingual Abilities and Types of LLMs

Intuitive or Dependent? Investigating LLMs' Behavior Style to Conflicting Prompts

PromptRobust: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts

A Study on Large Language Models' Limitations in Multiple-Choice Question Answering

Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves

SelfPrompt: Autonomously Evaluating LLM Robustness via Domain-Constrained Knowledge Guidelines and Refined Adversarial Prompts

Set-Based Prompting: Provably Solving the Language Model Order Dependency Problem

Wait, that's not an option: LLMs Robustness with Incorrect Multiple-Choice Options

On the Robustness of Dialogue History Representation in Conversational Question Answering: A Comprehensive Study and a New Prompt-based Method

Instances Need More Care: Rewriting Prompts for Instances with LLMs in the Loop Yields Better Zero-Shot Performance

Assessing Hidden Risks of LLMs: An Empirical Study on Robustness, Consistency, and Credibility

On the Robustness of Editing Large Language Models