Abstract:Foundation models (FM), such as large language models (LLMs), which are large-scale machine learning (ML) models, have demonstrated remarkable adaptability in various downstream software engineering (SE) tasks, such as code completion, code understanding, and software development. As a result, FM leaderboards, especially those hosted on cloud platforms, have become essential tools for SE teams to compare and select the best third-party FMs for their specific products and purposes. However, the lack of standardized guidelines for FM evaluation and comparison threatens the transparency of FM leaderboards and limits stakeholders' ability to perform effective FM selection. As a first step towards addressing this challenge, our research focuses on understanding how these FM leaderboards operate in real-world scenarios ("leaderboard operations") and identifying potential leaderboard pitfalls and areas for improvement ("leaderboard smells"). In this regard, we perform a multivocal literature review to collect up to 721 FM leaderboards, after which we examine their documentation and engage in direct communication with leaderboard operators to understand their workflow patterns. Using card sorting and negotiated agreement, we identify 5 unique workflow patterns and develop a domain model that outlines the essential components and their interaction within FM leaderboards. We then identify 8 unique types of leaderboard smells in LBOps. By mitigating these smells, SE teams can improve transparency, accountability, and collaboration in current LBOps practices, fostering a more robust and responsible ecosystem for FM comparison and selection.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the lack of standardization and transparency in the operation and evaluation process of current Foundation Models (FM) leaderboards. Specifically: 1. **Insufficient transparency and standardization in leaderboard operations**: Existing FM leaderboards are difficult to standardize in evaluating and comparing the risks and limitations of top - tier foundation models, which limits the ability of stakeholders to choose the most suitable FM. 2. **Problems in the operation process (Leaderboard Smells)**: Different leaderboards have unique operation modes, which may be accompanied by resource constraints and operational challenges and require a great deal of effort to solve. For example, the time cost of evaluating large - language models (LLM) is relatively high, and some leaderboards even have to suspend certain benchmark tests due to the high evaluation cost. 3. **Promoting responsible FM comparison**: In order to improve the reliability and transparency of FM leaderboards, it is necessary to identify and solve common problems in leaderboard operations, thereby establishing a more sound and responsible FM comparison ecosystem. ### Main research questions of the paper To meet the above challenges, this research proposes two main research questions (RQs): - **RQ1: How do FM leaderboards operate?** - By analyzing the operation processes and working modes of leaderboards, understand how they maintain functionality and relevance. - **RQ2: What are the problems or "Smells" that affect FM leaderboard operations?** - Identify and classify common problems in leaderboard operations, namely "leaderboard smells", to provide actionable insights to help leaderboard operators anticipate and mitigate similar challenges in future deployments. ### Overview of the methodology This research adopts a three - stage methodology to systematically investigate these problems: 1. **Multivocal Literature Collection**: - Conduct a comprehensive literature review through sources such as Google Scholar and GitHub Awesome Lists, and identify 721 FM leaderboards. 2. **Leaderboard Collection**: - Use the Iterative Backward Snowball Sampling algorithm to collect as many leaderboards and their related benchmark tests as possible. 3. **Leaderboard Analysis**: - Conduct a qualitative analysis of the collected FM leaderboards, extract common operation process patterns, and identify 8 unique "leaderboard smells" through card sorting and consensus - building techniques. Through these methods, this research aims to provide a more transparent, reliable, and standardized framework for FM leaderboard operations, thereby promoting responsible foundation model comparison and selection.

On the Workflows and Smells of Leaderboard Operations (LBOps): An Exploratory Study of Foundation Model Leaderboards

Efficient Performance Tracking: Leveraging Large Language Models for Automated Construction of Scientific Leaderboards

Software Engineering and Foundation Models: Insights from Industry Blogs Using a Jury of Foundation Models

Understanding LLM Development Through Longitudinal Study: Insights from the Open Ko-LLM Leaderboard

When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards

Exploring the Latest LLMs for Leaderboard Extraction

Open-LLM-Leaderboard: From Multi-choice to Open-style Questions for LLMs Evaluation, Benchmark, and Arena

ORKG-Leaderboards: A Systematic Workflow for Mining Leaderboards as a Knowledge Graph

Effective Context Selection in LLM-based Leaderboard Generation: An Empirical Study

Self-Improving-Leaderboard(SIL): A Call for Real-World Centric Natural Language Processing Leaderboards

Instruction Finetuning for Leaderboard Generation from Empirical AI Research

OpsEval: A Comprehensive IT Operations Benchmark Suite for Large Language Models

Evaluating Large Language Models with Grid-Based Game Competitions: An Extensible LLM Benchmark and Leaderboard

Open Ko-LLM Leaderboard2: Bridging Foundational and Practical Evaluation for Korean LLMs

Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand

KoLA: Carefully Benchmarking World Knowledge of Large Language Models

Utility is in the Eye of the User: A Critique of NLP Leaderboards

Finding Blind Spots in Evaluator LLMs with Interpretable Checklists

Automated Mining of Leaderboards for Empirical AI Research

Industrial Code Quality Benchmarks: Toward Gamification of Software Maintainability

Style Outweighs Substance: Failure Modes of LLM Judges in Alignment Benchmarking