Abstract:Pre-trained language models (PLMs) serve as backbones for various real-world systems. For high-stake applications, it's equally essential to have reasonable confidence estimations in predictions. While the vanilla confidence scores of PLMs can already be effectively utilized, PLMs consistently become overconfident in their wrong predictions, which is not desirable in practice. Previous work shows that introducing an extra calibration task can mitigate this issue. The basic idea involves acquiring additional data to train models in predicting the confidence of their initial predictions. However, it only demonstrates the feasibility of this kind of method, assuming that there are abundant extra available samples for the introduced calibration task. In this work, we consider the practical scenario that we need to effectively utilize training samples to make PLMs both task-solvers and self-calibrators. Three challenges are presented, including limited training samples, data imbalance, and distribution shifts. We first conduct pilot experiments to quantify various decisive factors in the calibration task. Based on the empirical analysis results, we propose a training algorithm LM-TOAST to tackle the challenges. Experimental results show that LM-TOAST can effectively utilize the training data to make PLMs have reasonable confidence estimations while maintaining the original task performance. Further, we consider three downstream applications, namely selective classification, adversarial defense, and model cascading, to show the practical usefulness of LM-TOAST. The code will be made public at \url{<a class="link-external link-https" href="https://github.com/Yangyi-Chen/LM-TOAST" rel="external noopener nofollow">this https URL</a>}.

Self-calibration for Language Model Quantization and Pruning

On the Impact of Calibration Data in Post-training Quantization and Pruning

Beware of Calibration Data for Pruning Large Language Models

Investigating Language-Specific Calibration For Pruning Multilingual Large Language Models

Self-Data Distillation for Recovering Quality in Pruned Large Language Models

Preserving Pre-trained Features Helps Calibrate Fine-tuned Language Models

Pruning Pre-trained Language Models with Principled Importance and Self-regularization

Making Pre-trained Language Models both Task-solvers and Self-calibrators

On the Impact of Quantization and Pruning of Self-Supervised Speech Models for Downstream Speech Recognition Tasks "In-the-Wild''

A Close Look into the Calibration of Pre-trained Language Models.

Calibrating Long-form Generations from Large Language Models

On Calibration of Pre-trained Code Models

Gradient-based Intra-attention Pruning on Pre-trained Language Models

Compact Language Models via Pruning and Knowledge Distillation

Deep Neural Compression Via Concurrent Pruning and Self-Distillation

Fill In The Gaps: Model Calibration and Generalization with Synthetic Data

Calibrate Before Use: Improving Few-Shot Performance of Language Models

Rethinking Pruning Large Language Models: Benefits and Pitfalls of Reconstruction Error Minimization

Bag of Tricks for In-Distribution Calibration of Pretrained Transformers

How to Prune Your Language Model: Recovering Accuracy on the "Sparsity May Cry'' Benchmark