Abstract:In this paper, we present a complete framework for quickly calibrating and administering a robust large-scale computerized adaptive test (CAT) with a small number of responses. Calibration - learning item parameters in a test - is done using AutoIRT, a new method that uses automated machine learning (AutoML) in combination with item response theory (IRT), originally proposed in [Sharpnack et al., 2024]. AutoIRT trains a non-parametric AutoML grading model using item features, followed by an item-specific parametric model, which results in an explanatory IRT model. In our work, we use tabular AutoML tools (<a class="link-external link-http" href="http://AutoGluon.tabular" rel="external noopener nofollow">this http URL</a>, [Erickson et al., 2020]) along with BERT embeddings and linguistically motivated NLP features. In this framework, we use Bayesian updating to obtain test taker ability posterior distributions for administration and scoring. For administration of our adaptive test, we propose the BanditCAT framework, a methodology motivated by casting the problem in the contextual bandit framework and utilizing item response theory (IRT). The key insight lies in defining the bandit reward as the Fisher information for the selected item, given the latent test taker ability from IRT assumptions. We use Thompson sampling to balance between exploring items with different psychometric characteristics and selecting highly discriminative items that give more precise information about ability. To control item exposure, we inject noise through an additional randomization step before computing the Fisher information. This framework was used to initially launch two new item types on the DET practice test using limited training data. We outline some reliability and exposure metrics for the 5 practice test experiments that utilized this framework.

AutoIRT: Calibrating Item Response Theory Models with Automated Machine Learning

BanditCAT and AutoIRT: Machine Learning Approaches to Computerized Adaptive Testing and Item Calibration

Item response theory in high-stakes pharmacy assessments

Item Response Theory -- A Statistical Framework for Educational and Psychological Measurement

Scalable Learning of Item Response Theory Models

fl-IRT-ing with Psychometrics to Improve NLP Bias Measurement

The irtQ R package: a user-friendly tool for item response theorybased test data analysis and calibration

Fairness Evaluation with Item Response Theory

Modeling Item-Level Heterogeneous Treatment Effects With the Explanatory Item Response Model: Leveraging Large-Scale Online Assessments to Pinpoint the Impact of Educational Interventions

Implicit assessment of language learning during practice as accurate as explicit testing

Introducing Flexible Monotone Multiple Choice Item Response Theory Models and Bit Scales

Variational Item Response Theory: Fast, Accurate, and Expressive

Modeling Item Response Theory with Stochastic Variational Inference

Bayesian Item Response Modeling in R with brms and Stan

A Comparative Study of Item Response Theory Models for Mixed Discrete-Continuous Responses

Enhancing Item Response Theory for Cognitive Diagnosis

Redefining Item Response Models for Small Samples

A neural network paradigm for modeling psychometric data and estimating IRT model parameters: Cross estimation network

A short tutorial on item response theory in rheumatology

KernSmoothIRT: An R Package for Kernel Smoothing in Item Response Theory

A Bayesian Nonparametric IRT Model