Abstract:In the computer vision and machine learning communities, as well as in many other research domains, rigorous evaluation of any new method, including classifiers, is essential. One key component of the evaluation process is the ability to compare and rank methods. However, ranking classifiers and accurately comparing their performances, especially when taking application-specific preferences into account, remains challenging. For instance, commonly used evaluation tools like Receiver Operating Characteristic (ROC) and Precision/Recall (PR) spaces display performances based on two scores. Hence, they are inherently limited in their ability to compare classifiers across a broader range of scores and lack the capability to establish a clear ranking among classifiers. In this paper, we present a novel versatile tool, named the Tile, that organizes an infinity of ranking scores in a single 2D map for two-class classifiers, including common evaluation scores such as the accuracy, the true positive rate, the positive predictive value, Jaccard's coefficient, and all F-beta scores. Furthermore, we study the properties of the underlying ranking scores, such as the influence of the priors or the correspondences with the ROC space, and depict how to characterize any other score by comparing them to the Tile. Overall, we demonstrate that the Tile is a powerful tool that effectively captures all the rankings in a single visualization and allows interpreting them.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is: how to effectively rank binary classifiers to better meet the requirements of specific application scenarios. Specifically, existing evaluation tools such as ROC (Receiver Operating Characteristic) and PR (Precision/Recall) spaces have limitations when comparing and ranking classifiers, especially when considering application - specific preferences. Therefore, the author proposes a new visualization tool - Tile (Tile Map), which is used to organize and compare an infinite number of ranking scores of binary classifiers and present them on a two - dimensional map. ### Specific Background of the Problem 1. **Limitations of Existing Evaluation Tools**: - Although ROC and PR spaces are widely used, they can only display performance based on two scores and cannot comprehensively compare the performance of classifiers in a wider range of scores. - These tools lack the ability to clearly rank classifiers, especially when considering the requirements of specific application scenarios (for example, minimizing false negatives in medical diagnosis, maximizing true negatives in security systems, etc.). 2. **Diversity of Requirements in Application Scenarios**: - Different application scenarios have different tolerances for classifier error types. For example: - In medical diagnosis, false negatives can lead to serious consequences and therefore need to be minimized. - In security systems, the cost of false positives may be occasional false alarms, but ensuring security is more important. - In quality control, false positives may cause unnecessary production interruptions and increase costs. 3. **Difficulty in Selecting Appropriate Evaluation Scores**: - There are a large number of evaluation scores in the literature, and each score focuses on different error types. Selecting a score suitable for specific application requirements is very challenging. - Ranking by a single score may lead to sub - optimal classifier selection, and evaluation spaces that combine two scores (such as ROC and PR) cannot directly rank classifiers. ### Proposed Solution To overcome the above problems, the author proposes a new tool named "Tile", whose main features include: - **Unified Two - Dimensional Map**: Tile organizes an infinite number of ranking scores on a two - dimensional map, making it possible to intuitively compare the performance of different classifiers. - **Parametric Settings**: Tile reflects application - specific preferences through two parameters. The first parameter controls the trade - off between true positives and true negatives, and the second parameter balances false positives and false negatives. - **Correspondence Analysis**: The correspondence between Tile and common evaluation spaces (such as ROC and PR) has been studied, especially the drawing of iso - performance lines. - **Enhanced Explanatory Power**: Through Tile, the relationships between different evaluation scores can be more easily explained, and classifiers can be directly ranked. ### Summary The main contribution of the paper is the introduction of a new visualization tool - Tile, which can not only organize and compare an infinite number of ranking scores, but also help researchers select the most appropriate classifier according to the requirements of specific application scenarios. This provides a powerful new method for classifier evaluation and selection.

The Tile: A 2D Map of Ranking Scores for Two-Class Classification

A Hitchhiker's Guide to Understanding Performances of Two-Class Classifiers

Learning to Rank for Maps at Airbnb

Class Maps for Visualizing Classification Results

Multiclass ROC

Not All the Same: Understanding and Informing Similarity Estimation in Tile-Based Video Games

Testing the Consistency of Performance Scores Reported for Binary Classification Problems

Foundations of the Theory of Performance-Based Ranking

DeepTileBars: Visualizing Term Distribution for Neural Information Retrieval

Interpretable3D: an Ad-Hoc Interpretable Classifier for 3D Point Clouds

Ranking evaluation metrics from a group-theoretic perspective

TexTile: A Differentiable Metric for Texture Tileability

Visualization of Tradeoff in Evaluation: from Precision-Recall & PN to LIFT, ROC & BIRD

The Treatment of Ties in Rank-Biased Overlap

Ranking by Aggregating Referees: Evaluating the Informativeness of Explanation Methods for Time Series Classification

Virtual Ground Truth, and Pre-selection of 3D Interest Points for Improved Repeatability Evaluation of 2D Detectors

BenchMetrics: a systematic benchmarking method for binary classification performance metrics

Top-K Pairwise Ranking: Bridging the Gap Among Ranking-Based Measures for Multi-Label Classification

IntTower: the Next Generation of Two-Tower Model for Pre-Ranking System

A novel evaluation methodology for supervised Feature Ranking algorithms