Abstract:Automatic side-by-side evaluation has emerged as a promising approach to evaluating the quality of responses from large language models (LLMs). However, analyzing the results from this evaluation approach raises scalability and interpretability challenges. In this paper, we present LLM Comparator, a novel visual analytics tool for interactively analyzing results from automatic side-by-side evaluation. The tool supports interactive workflows for users to understand when and why a model performs better or worse than a baseline model, and how the responses from two models are qualitatively different. We iteratively designed and developed the tool by closely working with researchers and engineers at a large technology company. This paper details the user challenges we identified, the design and development of the tool, and an observational study with participants who regularly evaluate their models.

What problem does this paper attempt to address?

### The Problem the Paper Attempts to Solve This paper aims to address the challenges of scalability and interpretability in the evaluation of large language models (LLMs). Specifically, the authors propose a new visualization analysis tool called **LLM Comparator** for interactively analyzing the results of side-by-side evaluations. With this tool, users can better understand under which circumstances the models perform better or worse and why such performance differences occur. ### Background and Motivation 1. **Evaluation Challenges**: - Traditional machine learning models can be evaluated by comparing them to ground truth answers, but for LLMs that generate long texts, setting ground truth answers is impractical. - Manual evaluation, while effective, is costly and difficult to scale. - Automatic side-by-side evaluation (i.e., having another LLM evaluate the output quality of two models) is a promising method, but the interpretability and actionability of its results still need improvement. 2. **Problems with Existing Workflows**: - There are no dedicated tools to analyze the results of automatic side-by-side evaluations, typically requiring manual loading into spreadsheets or computational notebooks. - It is difficult to quickly identify and compare behavioral differences between different models. - Switching between different tools is necessary to analyze performance in specific categories (e.g., email writing, programming). ### Solution The main features of the **LLM Comparator** tool include: 1. **Interactive Tables**: - Display each prompt and the corresponding responses, scores, and rationale summaries for the two models. - Provide overlapping word highlights to facilitate quick comparison of the two responses. - Allow viewing detailed scoring results, supporting users in deeply examining individual examples. 2. **Visual Summaries**: - **Score Distribution**: Display histograms of scores to help users understand the detailed distribution of scores. - **Win Rates by Category**: Show performance across different prompt categories, helping users identify performance differences under specific categories. - **Rationale Clustering**: Summarize a large number of rationales into several representative themes, helping users understand the reasons behind the scores. - **N-grams and Custom Functions**: Provide frequently occurring phrases and custom functions to help users deeply analyze the nuances of the responses. ### Applications and Effects - **User Feedback**: The tool has been successfully integrated into Google's large-scale evaluation pipeline, attracting over 400 users and supporting more than 1000 automatic side-by-side evaluation experiments. - **Usage Patterns**: - **Case-First Deep Dive**: Users form hypotheses about model behavior differences by deeply examining individual examples. - **Experience-Based Testing**: Users leverage prior knowledge to identify poor model behaviors. - **Rationale-Centric Top-Down Exploration**: Through the rationale clustering view, users can discover new data analysis methods. ### Future Directions - **LLM-Based Custom Metrics**: Further develop advanced attribute evaluation methods based on LLMs. - **Preconfigured Bad Patterns**: Preconfigure common bad patterns to reduce the need for manually defining new functions. - **Improved Rationale Clustering**: Optimize the computation method of rationale clustering to improve accuracy and efficiency. In summary, **LLM Comparator** significantly enhances the interpretability and actionability of large language model evaluations by providing interactive and visual analysis tools.

LLM Comparator: Visual Analytics for Side-by-Side Evaluation of Large Language Models

LLM Comparator: Interactive Analysis of Side-by-Side Evaluation of Large Language Models

LLMMaps -- A Visual Metaphor for Stratified Evaluation of Large Language Models

EasyJudge: an Easy-to-use Tool for Comprehensive Response Evaluation of LLMs

iScore: Visual Analytics for Interpreting How Language Models Automatically Score Summaries

LLMEval: A Preliminary Study on How to Evaluate Large Language Models

A Comprehensive Evaluation of Large Language Models on Legal Judgment Prediction

Evaluating Large Language Models at Evaluating Instruction Following

Revisiting Multi-Modal LLM Evaluation

Enhancing Trust in LLMs: Algorithms for Comparing and Interpreting LLMs

LLM-Assisted Visual Analytics: Opportunities and Challenges

A Survey on Evaluation of Large Language Models

Supporting Sensemaking of Large Language Model Outputs at Scale

LogEval: A Comprehensive Benchmark Suite for Large Language Models In Log Analysis

A Survey on Evaluation of Large Language ModelsJust Accepted

Aligning Human and LLM Judgments: Insights from EvalAssist on Task-Specific Evaluations and AI-assisted Assessment Strategy Preferences

T-Eval: Evaluating the Tool Utilization Capability of Large Language Models Step by Step

Finding Blind Spots in Evaluator LLMs with Interpretable Checklists

MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria

Benchmarking Cognitive Biases in Large Language Models as Evaluators