Multi-Model Consistency for LLMs’ Evaluation

Qinrui Zhu,Derui Lyu,Xi Fan,Xiangyu Wang,Qiang Tu,Yibin Zhan,Huanhuan Chen
DOI: https://doi.org/10.1109/ijcnn60899.2024.10651158
2024-01-01
Abstract:This paper introduces an evaluation method for large language models (LLMs) based on multi-model factual cognition consistency. Traditional evaluation methods, especially in terms of factuality assessments, face challenges in constructing extensive domain-specific question sets and relying on specific model answers. These methods fall short in the face of dynamic and diverse model development. To overcome these limitations, the proposed approach does not depend on a fixed set of standard answers. Instead, it utilizes the responses of multiple models to construct a dynamic, relative evaluation benchmark. We first developed a framework to capture and compare the cognitive consistency of different models when addressing specific questions. Subsequently, a dynamic iterative algorithm was designed to evaluate models based on these sets of answers. Experiments across multiple domains demonstrated the effectiveness of this method. This innovative evaluation strategy not only provides a more comprehensive and flexible approach to understanding and assessing the performance of LLMs in various scenarios but also offers practical guidance for future model development and improvement.
What problem does this paper attempt to address?