Abstract:Properly understanding the performances of classifiers is essential in various scenarios. However, the literature often relies only on one or two standard scores to compare classifiers, which fails to capture the nuances of application-specific requirements, potentially leading to suboptimal classifier selection. Recently, a paper on the foundations of the theory of performance-based ranking introduced a tool, called the Tile, that organizes an infinity of ranking scores into a 2D map. Thanks to the Tile, it is now possible to evaluate and compare classifiers efficiently, displaying all possible application-specific preferences instead of having to rely on a pair of scores. In this paper, we provide a first hitchhiker's guide for understanding the performances of two-class classifiers by presenting four scenarios, each showcasing a different user profile: a theoretical analyst, a method designer, a benchmarker, and an application developer. Particularly, we show that we can provide different interpretative flavors that are adapted to the user's needs by mapping different values on the Tile. As an illustration, we leverage the newly introduced Tile tool and the different flavors to rank and analyze the performances of 74 state-of-the-art semantic segmentation models in two-class classification through the eyes of the four user profiles. Through these user profiles, we demonstrate that the Tile effectively captures the behavior of classifiers in a single visualization, while accommodating an infinite number of ranking scores.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: How to understand and compare the performance of binary classifiers more comprehensively and meticulously. Traditionally, in the literature, usually only one or two standard scores are relied on to compare classifiers. This method cannot capture the nuances in specific application scenarios and may lead to sub - optimal classifier selection. Specifically, this paper can more effectively evaluate and compare classifiers by introducing a new tool - "Tile", which organizes an infinite number of ranking scores into a two - dimensional graph, showing all possible application - specific preferences instead of relying solely on a pair of scores. The paper provides specific usage guidelines for four different user roles (theoretical analyst, method designer, benchmark tester, and application developer) and shows how to use the "Tile" tool to analyze and rank the performance of classifiers in these scenarios. ### Main contributions of the paper: 1. **Provide a practical guide to understanding the performance of binary classifiers**, based on a rigorous theoretical foundation. 2. **For the specific needs of four common user roles**, detail the tools to be used, construction methods, and result interpretations. 3. **By analyzing and ranking 74 state - of - the - art semantic segmentation models**, provide practical application cases for the computer vision community. ### Four user roles and their needs: 1. **Theoretical analyst**: Focus on the theoretical relationships between different scoring metrics, ensuring that the selected scoring metrics provide unique and non - redundant information. 2. **Method designer**: Need to evaluate the performance of new methods, compare them with existing methods, understand the performance under different importance settings, and optimize hyper - parameters. 3. **Benchmark tester**: Organize challenges in the scientific community and need to rank the participating methods. 4. **Application developer**: Select the most appropriate classification method according to application requirements. ### Tools used: - **Correlation Tile**: Used to show the linear or rank correlation between a certain reference score and other classic scores. - **Value Tile**: Displays the score value of a given entity at each point on the Tile. - **Baseline Value Tile**: Displays the minimum value of each score in a given set of entities. - **State - of - the - art Value Tile**: Displays the maximum value of each score in a given set of entities. Through these tools, the paper provides a systematic method for different users to evaluate and select binary classifiers, ensuring that the needs of specific application scenarios can be better met.

A Hitchhiker's Guide to Understanding Performances of Two-Class Classifiers

The Tile: A 2D Map of Ranking Scores for Two-Class Classification

Foundations of the Theory of Performance-Based Ranking

Interpretable3D: an Ad-Hoc Interpretable Classifier for 3D Point Clouds

A Strategy on Selecting Performance Metrics for Classifier Evaluation.

Improving classification performance by feature space transformations and model selection

IntTower: the Next Generation of Two-Tower Model for Pre-Ranking System

Ranking Perspective for Tree-based Methods with Applications to Symbolic Feature Selection

Understanding Prediction Discrepancies in Machine Learning Classifiers

Beyond Average Performance -- exploring regions of deviating performance for black box classification models

Performance Evaluation in Machine Learning: The Good, the Bad, the Ugly, and the Way Forward

Class Maps for Visualizing Classification Results

Testing the Consistency of Performance Scores Reported for Binary Classification Problems

Learning to Rank for Maps at Airbnb

Never mind the metrics -- what about the uncertainty? Visualising confusion matrix metric distributions

Write a Classifier: Predicting Visual Classifiers from Unstructured Text

Exploring Classifiers with Differentiable Decision Boundary Maps

Appropriateness of Performance Indices for Imbalanced Data Classification: An Analysis

Empirical analysis of performance assessment for imbalanced classification

Fair evaluation of classifier predictive performance based on binary confusion matrix

Interpretable classifiers for tabular data via discretization and feature selection