On the Workflows and Smells of Leaderboard Operations (LBOps): An Exploratory Study of Foundation Model Leaderboards

Zhimin Zhao,Abdul Ali Bangash,Filipe Roseiro Côgo,Bram Adams,Ahmed E. Hassan
2024-07-13
Abstract:Foundation models (FM), such as large language models (LLMs), which are large-scale machine learning (ML) models, have demonstrated remarkable adaptability in various downstream software engineering (SE) tasks, such as code completion, code understanding, and software development. As a result, FM leaderboards, especially those hosted on cloud platforms, have become essential tools for SE teams to compare and select the best third-party FMs for their specific products and purposes. However, the lack of standardized guidelines for FM evaluation and comparison threatens the transparency of FM leaderboards and limits stakeholders' ability to perform effective FM selection. As a first step towards addressing this challenge, our research focuses on understanding how these FM leaderboards operate in real-world scenarios ("leaderboard operations") and identifying potential leaderboard pitfalls and areas for improvement ("leaderboard smells"). In this regard, we perform a multivocal literature review to collect up to 721 FM leaderboards, after which we examine their documentation and engage in direct communication with leaderboard operators to understand their workflow patterns. Using card sorting and negotiated agreement, we identify 5 unique workflow patterns and develop a domain model that outlines the essential components and their interaction within FM leaderboards. We then identify 8 unique types of leaderboard smells in LBOps. By mitigating these smells, SE teams can improve transparency, accountability, and collaboration in current LBOps practices, fostering a more robust and responsible ecosystem for FM comparison and selection.
Software Engineering,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the lack of standardization and transparency in the operation and evaluation process of current Foundation Models (FM) leaderboards. Specifically: 1. **Insufficient transparency and standardization in leaderboard operations**: Existing FM leaderboards are difficult to standardize in evaluating and comparing the risks and limitations of top - tier foundation models, which limits the ability of stakeholders to choose the most suitable FM. 2. **Problems in the operation process (Leaderboard Smells)**: Different leaderboards have unique operation modes, which may be accompanied by resource constraints and operational challenges and require a great deal of effort to solve. For example, the time cost of evaluating large - language models (LLM) is relatively high, and some leaderboards even have to suspend certain benchmark tests due to the high evaluation cost. 3. **Promoting responsible FM comparison**: In order to improve the reliability and transparency of FM leaderboards, it is necessary to identify and solve common problems in leaderboard operations, thereby establishing a more sound and responsible FM comparison ecosystem. ### Main research questions of the paper To meet the above challenges, this research proposes two main research questions (RQs): - **RQ1: How do FM leaderboards operate?** - By analyzing the operation processes and working modes of leaderboards, understand how they maintain functionality and relevance. - **RQ2: What are the problems or "Smells" that affect FM leaderboard operations?** - Identify and classify common problems in leaderboard operations, namely "leaderboard smells", to provide actionable insights to help leaderboard operators anticipate and mitigate similar challenges in future deployments. ### Overview of the methodology This research adopts a three - stage methodology to systematically investigate these problems: 1. **Multivocal Literature Collection**: - Conduct a comprehensive literature review through sources such as Google Scholar and GitHub Awesome Lists, and identify 721 FM leaderboards. 2. **Leaderboard Collection**: - Use the Iterative Backward Snowball Sampling algorithm to collect as many leaderboards and their related benchmark tests as possible. 3. **Leaderboard Analysis**: - Conduct a qualitative analysis of the collected FM leaderboards, extract common operation process patterns, and identify 8 unique "leaderboard smells" through card sorting and consensus - building techniques. Through these methods, this research aims to provide a more transparent, reliable, and standardized framework for FM leaderboard operations, thereby promoting responsible foundation model comparison and selection.