Abstract:Evaluation plays a critical role in deep learning as a fundamental block of any prediction-based system. However, the vast number of Natural Language Processing (NLP) tasks and the development of various metrics have led to challenges in evaluating different systems with different metrics. To address these challenges, we introduce jury, a toolkit that provides a unified evaluation framework with standardized structures for performing evaluation across different tasks and metrics. The objective of jury is to standardize and improve metric evaluation for all systems and aid the community in overcoming the challenges in evaluation. Since its open-source release, jury has reached a wide audience and is available at

What problem does this paper attempt to address?

The paper primarily addresses several key issues in model evaluation within the field of Natural Language Processing (NLP): 1. **Standardized Evaluation Framework**: Existing evaluation methods for Natural Language Generation (NLG) tasks lack a unified standard. Different tasks and metrics lead to complexity and inconsistency in the evaluation process. The paper proposes a toolkit named "jury" that aims to provide a unified evaluation framework, simplifying the evaluation process across different tasks and metrics. 2. **Comprehensive Multi-Metric Evaluation**: The paper points out that in practical applications, comprehensive evaluation using multiple metrics is common. However, existing evaluation libraries usually do not support concurrent calculation of multiple metrics or do not provide a convenient way to do so. Therefore, the "jury" toolkit supports concurrent calculation of multiple metrics, significantly improving evaluation efficiency. 3. **Support for Multiple References and Predictions**: In tasks like machine translation, a single input may correspond to multiple reasonable outputs (i.e., reference translations). Existing libraries often do not support such multi-reference evaluations. "jury" not only supports multi-reference evaluation but also supports the evaluation of multiple prediction results for a single input, which helps improve the correlation between evaluation results and human judgment. 4. **Task Mapping**: The output types of different NLP tasks vary (e.g., the output of NLG tasks is text, while the output of NLI tasks is label IDs). "jury" uses a task mapping mechanism to automatically adjust the evaluation method based on different tasks, reducing the customization workload for users. In summary, the main goal of this paper is to develop a comprehensive evaluation toolkit "jury" to improve and standardize the model evaluation process in the NLP field, especially in NLG tasks. By introducing features such as a unified interface, concurrent calculation, support for multiple references and predictions, and task mapping, "jury" aims to address the shortcomings of existing evaluation tools and provide researchers with a more efficient and convenient evaluation solution.

Jury: A Comprehensive Evaluation Toolkit

Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form Text

CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution

EasyJudge: an Easy-to-use Tool for Comprehensive Response Evaluation of LLMs

Jury Learning: Integrating Dissenting Voices into Machine Learning Models

Constructing Domain-Specific Evaluation Sets for LLM-as-a-judge

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

JudgeBench: A Benchmark for Evaluating LLM-based Judges

Benchmarks and Metrics for Evaluations of Code Generation: A Critical Review

Evaluating Evaluation Metrics: A Framework for Analyzing NLG Evaluation Metrics using Measurement Theory

Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models

A Survey on Legal Judgment Prediction: Datasets, Metrics, Models and Challenges

Eureka: Evaluating and Understanding Large Foundation Models

LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks

A Survey on LLM-as-a-Judge

Unveiling LLM Evaluation Focused on Metrics: Challenges and Solutions

BotEval: Facilitating Interactive Human Evaluation

Evaluating the Evaluator: Measuring LLMs' Adherence to Task Evaluation Instructions

F-Eval: Assessing Fundamental Abilities with Refined Evaluation Methods

Automatic Construction of Evaluation Suites for Natural Language Generation Datasets

From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge