Abstract:Nowadays, the quality of responses generated by different modern large language models (LLMs) is hard to evaluate and compare automatically. Recent studies suggest and predominantly use LLMs for reference-free evaluation of open-ended question answering. More specifically, they use the recognized "strongest" LLM as the evaluator, which conducts pairwise comparisons of candidate models' answers and provides a ranking score. However, this intuitive method has multiple problems, such as bringing in self-enhancement (favoring its own answers) and positional bias. We draw insights and lessons from the educational domain (Cho & MacArthur, 2011; Walsh, 2014) to improve LLM-based evaluations. Specifically, we propose (1) the peer rank (PR) algorithm that takes into account each peer LLM's pairwise preferences of all answer pairs, and outputs a final ranking of models; and (2) peer discussion (PD), where we prompt two LLMs to discuss and try to reach a mutual agreement on the preferences of two answers. We conduct experiments on two benchmark datasets. We find that our approaches achieve higher accuracy and align better with human judgments. Interestingly, PR can induce a relatively accurate self-ranking of models under the anonymous setting, where each model's name is unrevealed. Our work provides space to explore evaluating models that are hard to compare for humans.

Reranking Answers for Definitional QA Using Language Modeling

Learning To Rank Answers For Definitional Question Answering

Term Selection and Result Reranking for Question Retrieval by Exploiting Hierarchical Classification.

On the order of the operators in the Douglas–Rachford algorithm

Addressing Community Question Answering in English and Arabic

Multimodal Reranking for Knowledge-Intensive Visual Question Answering

Enhancing Q&A Text Retrieval with Ranking Models: Benchmarking, fine-tuning and deploying Rerankers for RAG

A List Question Answering Method Based on Phrase-retrieval and Answer-ranking

Using Multiple Combined Ranker for Answering Definitional Questions

LINKAGE: Listwise Ranking among Varied-Quality References for Non-Factoid QA Evaluation via LLMs

Learning To Rank For Question-Oriented Software Text Retrieval

Optimal Answerer Ranking for New Questions in Community Question Answering

Answering Definition Question: Ranking For Top-K

Learning to Combine Answer Boundary Detection and Answer Re-ranking for Phrase-Indexed Question Answering

CQArank: jointly model topics and expertise in community question answering.

A De Nitional Question Answering System Based on Dependency Relation

Answering Definitional Question By Dependency-Based Knowledge

Discriminate and Reconstruct: Learning from Language Model to Answer Keyword Questions

MrRank: Improving Question Answering Retrieval System through Multi-Result Ranking Model

PRD: Peer Rank and Discussion Improve Large Language Model based Evaluations

QCG-Rerank: Chunks Graph Rerank with Query Expansion in Retrieval-Augmented LLMs for Tourism Domain