Abstract:Pre-trained language models (PLMs) serve as backbones for various real-world systems. For high-stake applications, it's equally essential to have reasonable confidence estimations in predictions. While the vanilla confidence scores of PLMs can already be effectively utilized, PLMs consistently become overconfident in their wrong predictions, which is not desirable in practice. Previous work shows that introducing an extra calibration task can mitigate this issue. The basic idea involves acquiring additional data to train models in predicting the confidence of their initial predictions. However, it only demonstrates the feasibility of this kind of method, assuming that there are abundant extra available samples for the introduced calibration task. In this work, we consider the practical scenario that we need to effectively utilize training samples to make PLMs both task-solvers and self-calibrators. Three challenges are presented, including limited training samples, data imbalance, and distribution shifts. We first conduct pilot experiments to quantify various decisive factors in the calibration task. Based on the empirical analysis results, we propose a training algorithm LM-TOAST to tackle the challenges. Experimental results show that LM-TOAST can effectively utilize the training data to make PLMs have reasonable confidence estimations while maintaining the original task performance. Further, we consider three downstream applications, namely selective classification, adversarial defense, and model cascading, to show the practical usefulness of LM-TOAST. The code will be made public at \url{<a class="link-external link-https" href="https://github.com/Yangyi-Chen/LM-TOAST" rel="external noopener nofollow">this https URL</a>}.

Accelerating Pretrained Language Model Inference Using Weighted Ensemble Self-distillation

Adapt-and-Distill: Developing Small, Fast and Effective Pretrained Language Models for Domains.

Accelerating Pre-trained Language Models via Calibrated Cascade

COST-EFF: Collaborative Optimization of Spatial and Temporal Efficiency with Slenderized Multi-exit Language Models

Exploring Extreme Parameter Compression for Pre-trained Language Models

SmartBERT: A Promotion of Dynamic Early Exiting Mechanism for Accelerating BERT Inference.

Accelerating Large Language Model Inference with Self-Supervised Early Exits

One Teacher is Enough? Pre-trained Language Model Distillation from Multiple Teachers

Exploiting Student Parallelism for Low-latency GPU Inference of BERT-like Models in Online Services

Intepreting & Improving Pretrained Language Models: A Probabilistic Conceptual Approach

Joint Dual Feature Distillation and Gradient Progressive Pruning for BERT compression

TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference

NewsBERT: Distilling Pre-trained Language Model for Intelligent News Application

Revisiting Intermediate Layer Distillation for Compressing Language Models: An Overfitting Perspective

Are Intermediate Layers and Labels Really Necessary? A General Language Model Distillation Method

Boosting Inference Efficiency: Unleashing the Power of Parameter-Shared Pre-trained Language Models

Patient Knowledge Distillation for BERT Model Compression

Self-Data Distillation for Recovering Quality in Pruned Large Language Models

BADGE: Speeding Up BERT Inference after Deployment Via Block-wise Bypasses and Divergence-based Early Exiting.

Length-Adaptive Distillation: Customizing Small Language Model for Dynamic Token Pruning.

Making Pre-trained Language Models both Task-solvers and Self-calibrators