Abstract:The continuous advancement of large language models (LLMs) has brought increasing attention to the critical issue of developing fair and reliable methods for evaluating their performance. Particularly, the emergence of subjective or non-subjective cheating phenomena, such as test set leakage and prompt format overfitting, poses significant challenges to the reliable evaluation of LLMs. Since evaluation frameworks often utilize Regular Expression (RegEx) for answer extraction, some models may adjust their responses to comply with specific formats that are easily extractable by RegEx. Nevertheless, the key answer extraction module based on RegEx frequently suffers from extraction errors. This paper conducts a comprehensive analysis of the entire LLM evaluation chain, demonstrating that optimizing the key answer extraction module can improve extraction accuracy, reduce LLMs' reliance on specific answer formats, and enhance the reliability of LLM evaluation. To address these issues, we propose xFinder, a model specifically designed for key answer extraction. As part of this process, we create a specialized dataset, the Key Answer Finder (KAF) dataset, to ensure effective model training and evaluation. Through generalization testing and evaluation in real-world scenarios, the results demonstrate that the smallest xFinder model with only 500 million parameters achieves an average answer extraction accuracy of 93.42%. In contrast, RegEx accuracy in the best evaluation framework is 74.38%. xFinder exhibits stronger robustness and higher accuracy compared to existing evaluation frameworks.

Automatic Extraction and Filtration of Multiword Units1.

Automatic Filtration of Multiword Units

Association Measures for Collocation Extraction

Two-Character Chinese Word Extraction Based on Hybrid of Internal and Contextual Measures

AUTOMATIC EXTRACTION OF CHINESE-ENGLISH PHRASE TRANSLATION PAIRS

Automatic Extraction of Multiword Expressions Combining Statistical and Similarity Approaches

A study on the classification of stylistic and formal features in English based on corpus data testing

Chinese Word Extraction Based on the Internal Associative Strength of Character Strings

Do Multi-Sense Embeddings Improve Natural Language Understanding?

Finite State Automata on Multi-Word Units for Efficient Text-Mining

Research on Automatic Chinese Multi-word Term Extraction Based on Term Component

Automatic Keywords Extraction Based on Co-Occurrence and Semantic Relationships Between Words

Research on Automatic Chinese Multi-word Term Extraction Based on Integration of Web Information and Term Component

New Word Extraction from Chinese Financial Documents.

Chinese Multi-word Chunks Extraction for Computer Aided Translation

Determine-Then-Ensemble: Necessity of Top-k Union for Large Language Model Ensembling

Extracting terminologically relevant collocations in the translation of chinese monograph

Collocation Extraction Using Monolingual Word Alignment Method.

xFinder: Robust and Pinpoint Answer Extraction for Large Language Models

MICRank: Multi-information interconstrained keyphrase extraction

A Synergetic Approach to the Relationship Between the Length and Frequency among English Multiword Formulaic Sequences.