Abstract:Item Response Theory (IRT) has been widely used in educational psychometrics to assess student ability, as well as the difficulty and discrimination of test questions. In this context, discrimination specifically refers to how effectively a question distinguishes between students of different ability levels, and it does not carry any connotation related to fairness. In recent years, IRT has been successfully used to evaluate the predictive performance of Machine Learning (ML) models, but this paper marks its first application in fairness evaluation. In this paper, we propose a novel Fair-IRT framework to evaluate a set of predictive models on a set of individuals, while simultaneously eliciting specific parameters, namely, the ability to make fair predictions (a feature of predictive models), as well as the discrimination and difficulty of individuals that affect the prediction results. Furthermore, we conduct a series of experiments to comprehensively understand the implications of these parameters for fairness evaluation. Detailed explanations for item characteristic curves (ICCs) are provided for particular individuals. We propose the flatness of ICCs to disentangle the unfairness between individuals and predictive models. The experiments demonstrate the effectiveness of this framework as a fairness evaluation tool. Two real-world case studies illustrate its potential application in evaluating fairness in both classification and regression tasks. Our paper aligns well with the Responsible Web track by proposing a Fair-IRT framework to evaluate fairness in ML models, which directly contributes to the development of a more inclusive, equitable, and trustworthy AI.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of fairness evaluation in machine learning (ML) models. Specifically, the authors propose a new framework based on Item Response Theory (IRT) - Fair - IRT, which is used to evaluate the fair performance of a set of prediction models on a set of individuals and simultaneously derive specific parameters, including: 1. **Fair prediction ability of prediction models**: That is, the ability of prediction models to make fair predictions on different individuals. 2. **Discrimination and difficulty of individuals**: These parameters affect the prediction results. Among them, "discrimination" refers to the ability of the prediction model to distinguish different individuals, and "difficulty" represents the ease with which an individual is treated fairly. #### Specific problem background In recent years, IRT has been successfully applied to evaluate the prediction performance of machine - learning models, but this is the first time it has been applied to fairness evaluation. Currently, in many real - world applications involving human - related decisions, the issue of fairness is becoming increasingly important. For example: - The COMPAS model shows racial bias when predicting the recidivism risk of defendants. - Facebook's recommendation algorithm has a phenomenon of racial discrimination. - Mate AI image generator cannot correctly depict certain racial combinations. These problems indicate that data sets or prediction models may become sources of unfairness, leading to serious social problems. Therefore, there is an urgent need for a fairness evaluation tool to evaluate the fairness of data sets and prediction models. Existing research usually only reports pairwise comparisons between prediction models, using various fairness indicators, but often fails to reveal where and how prediction models fail, and also fails to identify the unique strengths and weaknesses of each prediction model. #### Main contributions of the Fair - IRT framework 1. **Applying IRT to fairness evaluation for the first time**: Through the Fair - IRT framework, the fair performance of individuals and prediction models can be evaluated, the ability of prediction models can be explained, and individuals who are treated unfairly can be identified. 2. **Distinguishing the unfairness of individual characteristics and prediction models**: Two methods are proposed to decouple the unfairness between individual characteristics and prediction models. One is to explain through the flatness of the ICC curve; the other is to introduce the Rasch beta IRT model as the backbone of the Fair - IRT framework to provide a quantitative measure of unfairness. 3. **Verification in practical applications**: The effectiveness of the Fair - IRT framework was evaluated on two real - world data sets. The experimental results show that Fair - IRT provides a comprehensive explanation for fairness evaluation and helps to develop more inclusive, fairer, and more trustworthy artificial intelligence systems. In summary, by introducing the Fair - IRT framework, this paper fills the gaps in existing fairness evaluation methods and provides a new and effective tool to evaluate the fair performance of machine - learning models.

Fairness Evaluation with Item Response Theory

FairRec: Fairness Testing for Deep Recommender Systems

Item Response Theory -- A Statistical Framework for Educational and Psychological Measurement

fl-IRT-ing with Psychometrics to Improve NLP Bias Measurement

A flexible framework for evaluating user and item fairness in recommender systems

AutoIRT: Calibrating Item Response Theory Models with Automated Machine Learning

Histological patterns of the metastases in pulmonary adenomatosis of sheep (jaagsiekte).

Understanding and improving fairness in cognitive diagnosis

Evaluation Measures of Individual Item Fairness for Recommender Systems: A Critical Study

A Psychometric Framework for Evaluating Fairness in Algorithmic Decision Making: Differential Algorithmic Functioning

Fairness Evaluation in Text Classification: Machine Learning Practitioner Perspectives of Individual and Group Fairness

Comparing Attitudes Across Groups: An IRT-Based Item-Fit Statistic for the Analysis of Measurement Invariance

Make Fairness More Fair: Fair Item Utility Estimation and Exposure Re-Distribution

Intersectional Two-sided Fairness in Recommendation

FairIF: Boosting Fairness in Deep Learning via Influence Functions with Validation Set Sensitive Attributes

Using Interpretable Machine Learning for Differential Item Functioning Detection in Psychometric Tests

Measuring, Interpreting, and Improving Fairness of Algorithms using Causal Inference and Randomized Experiments

Scalable Learning of Item Response Theory Models

Error Parity Fairness: Testing for Group Fairness in Regression Tasks

Should Fairness be a Metric or a Model? A Model-based Framework for Assessing Bias in Machine Learning Pipelines

Enhancing Item Response Theory for Cognitive Diagnosis