Abstract:With the rapid development of NLP, large-scale language models (LLMs) excel in various tasks across multiple domains now. However, existing benchmarks may not adequately measure these models' capabilities, especially when faced with new knowledge. In this paper, we address the lack of benchmarks to evaluate LLMs' ability to handle new knowledge, an important and challenging aspect in the rapidly evolving world. We propose an approach called KnowGen that generates new knowledge by altering existing entity attributes and relationships, resulting in artificial entities that are distinct from real-world entities. With KnowGen, we introduce a benchmark named ALCUNA to assess LLMs' abilities in knowledge understanding, differentiation, and association. We benchmark several LLMs, reveals that their performance in face of new knowledge is not satisfactory, particularly in reasoning between new and internal knowledge. We also explore the impact of entity similarity on the model's understanding of entity knowledge and the influence of contextual entities. We appeal to the need for caution when using LLMs in new scenarios or with new knowledge, and hope that our benchmarks can help drive the development of LLMs in face of new knowledge.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the evaluation of the capabilities of large - scale language models (LLMs) in the face of new knowledge. Specifically, existing benchmark tests may not fully measure the ability of these models to handle new knowledge, especially in a rapidly changing world where models frequently encounter new knowledge. The author points out that although there are currently many benchmark tests for evaluating the performance of LLMs on various tasks, there is a lack of benchmark tests specifically for the ability to handle new knowledge. Therefore, this paper proposes a method for generating new knowledge - KnowGen, and based on this method, constructs a benchmark test named ALCUNA to evaluate the ability of LLMs in understanding, differentiating, and correlating new knowledge. ### Main Contributions 1. **Proposing a New Method**: Proposed a method for generating new knowledge - KnowGen. This method generates artificial entities by changing the attributes and relationships of existing entities, and these entities are different from those in the real world. 2. **Constructing a New Benchmark**: Applied the KnowGen method to generate a dataset in the biological field - ALCUNA, as a benchmark for evaluating the performance of models in the face of new knowledge. 3. **Evaluating Multiple Models**: Used the ALCUNA benchmark to evaluate and analyze multiple popular large - language models, including ChatGPT, Alpaca, Vicuna, and ChatGLM, revealing that these models perform poorly when handling new knowledge, especially when reasoning about the relationship between new knowledge and internal knowledge. ### Research Background - **Advances in Large - scale Language Models**: In recent years, large - scale language models have made significant progress in the field of natural language processing (NLP) and can perform well on a variety of tasks. - **Limitations of Existing Benchmarks**: Existing benchmark tests may not be sufficient to measure the performance of these models in handling new knowledge because these benchmarks are usually based on known data and tasks. - **Importance of New Knowledge**: In a rapidly changing world, models often need to handle new knowledge, and retraining models to adapt to new knowledge is very expensive and unrealistic. ### Method - **KnowGen Method**: Generate new entities by changing the attributes and relationships of existing entities. These new entities are different from those in the real world but have reasonable attributes and relationships. - **ALCUNA Benchmark**: Based on the generated new entities, constructed a dataset containing 3,554 new entities and 84,351 questions. These questions are divided into three categories: knowledge understanding (KU), knowledge differentiation (KD), and knowledge association (KA). ### Experimental Results - **Overall Performance**: ChatGPT performs the best in all settings, followed by Vicuna. Overall, LLMs perform poorly when handling new knowledge, especially in knowledge association. - **Influencing Factors**: Entity similarity and the provided knowledge content will affect the performance of the model. For example, the more similar the new entity is to the existing entity, the more difficult it is for the model to distinguish; providing the knowledge of the parent entity will exacerbate the confusion. ### Conclusion Current large - scale language models have obvious deficiencies in handling new knowledge, especially when reasoning about the relationship between new knowledge and internal knowledge. The ALCUNA benchmark proposed in this paper can better evaluate the ability of these models in the face of new knowledge and provides a valuable tool for future research.

ALCUNA: Large Language Models Meet New Knowledge