Abstract:One of the major aspects contributing to the striking performance of large language models (LLMs) is the vast amount of factual knowledge accumulated during pre-training. Yet, many LLMs suffer from self-inconsistency, which raises doubts about their trustworthiness and reliability. This paper focuses on entity type ambiguity, analyzing the proficiency and consistency of state-of-the-art LLMs in applying factual knowledge when prompted with ambiguous entities. To do so, we propose an evaluation protocol that disentangles knowing from applying knowledge, and test state-of-the-art LLMs on 49 ambiguous entities. Our experiments reveal that LLMs struggle with choosing the correct entity reading, achieving an average accuracy of only 85%, and as low as 75% with underspecified prompts. The results also reveal systematic discrepancies in LLM behavior, showing that while the models may possess knowledge, they struggle to apply it consistently, exhibit biases toward preferred readings, and display self-inconsistencies. This highlights the need to address entity ambiguity in the future for more trustworthy LLMs.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve The paper primarily focuses on the consistency and reliability issues of large language models (LLMs) when dealing with entity type ambiguity. Specifically, the paper attempts to address the following key issues: 1. **Identification and Application of Entity Type Ambiguity**: - **Research Question 1 (RQ1)**: How well can large language models implicitly resolve entity type ambiguity in a given prompt context? - Researchers designed a series of experiments to evaluate whether the models can correctly identify and apply the correct meaning of ambiguous entities. 2. **Model Preference for "Preferred Interpretations"**: - **Research Question 2 (RQ2)**: To what extent is the model's ability to infer the correct entity type based on "preferred interpretations"? - The study found that models exhibit a significant preference when dealing with ambiguous entities, tending to choose more common interpretations (e.g., company names), which leads to inconsistencies and errors in some cases. 3. **Model's Self-Verification Ability**: - **Research Question 3 (RQ3)**: Can the model self-verify its answers after successfully resolving ambiguity? - Further experiments revealed that even if the models can correctly identify ambiguous entities, they still perform poorly in subsequent verification of these answers, showing a lack of internal knowledge consistency. ### Main Findings 1. **Model Performance in Handling Entity Type Ambiguity**: - Although all models demonstrated an understanding of different entity readings, their accuracy in choosing the correct reading in practical applications was only 85.3%. Even for clearly prompted non-ambiguous questions, the models' accuracy reached only 90.5%. 2. **Model Preference for "Preferred Interpretations"**: - Models showed a clear bias towards more common interpretations when dealing with ambiguous entities, such as interpreting "Boeing" as a company rather than an individual. This preference led to inconsistencies and errors in handling certain entities. 3. **Model's Self-Verification Ability Deficiency**: - Even if the models could correctly identify ambiguous entities, their performance in subsequent verification of these answers was still poor. For example, when asked, "Is December 5, 1901, Disney's birth date?" the model might answer "No," despite having previously provided the correct information. ### Conclusion The paper points out that current large language models perform poorly in handling entity type ambiguity and exhibit significant preference biases. Additionally, the models show notable deficiencies in self-verifying their answers. These findings highlight the importance of improving model self-consistency to enhance their reliability and credibility in practical applications.

To Know or Not To Know? Analyzing Self-Consistency of Large Language Models under Ambiguity

Self-Consistency of Large Language Models under Ambiguity

Do Large Language Models Know What They Don't Know?

CLAMBER: A Benchmark of Identifying and Clarifying Ambiguous Information Needs in Large Language Models

Do Large Language Models Exhibit Cognitive Dissonance? Studying the Difference Between Revealed Beliefs and Stated Answers

Can Large Language Models Faithfully Express Their Intrinsic Uncertainty in Words?

We're Afraid Language Models Aren't Modeling Ambiguity

Can LLMs assist with Ambiguity? A Quantitative Evaluation of various Large Language Models on Word Sense Disambiguation

Don't Just Say "I don't know"! Self-aligning Large Language Models for Responding to Unknown Questions with Explanations

Knowledge-based Consistency Testing of Large Language Models

Statistical Knowledge Assessment for Large Language Models

Evaluating Consistency and Reasoning Capabilities of Large Language Models

Distinguishing the Knowable from the Unknowable with Language Models

Resolving Knowledge Conflicts in Large Language Models

Aggregation Artifacts in Subjective Tasks Collapse Large Language Models' Posteriors

"I'm Not Sure, But...": Examining the Impact of Large Language Models' Uncertainty Expression on User Reliance and Trust

Relying on the Unreliable: The Impact of Language Models' Reluctance to Express Uncertainty

Aligning Language Models to Explicitly Handle Ambiguity

Semantic Consistency for Assuring Reliability of Large Language Models

Are Large Language Models More Honest in Their Probabilistic or Verbalized Confidence?

Enhancing Knowledge Graph Consistency through Open Large Language Models: A Case Study