Advancing neural network calibration: The role of gradient decay in large-margin Softmax optimization

Siyuan Zhang,Linbo Xie
DOI: https://doi.org/10.1016/j.neunet.2024.106457
Abstract:This study introduces a novel hyperparameter in the Softmax function to regulate the rate of gradient decay, which is dependent on sample probability. Our theoretical and empirical analyses reveal that both model generalization and calibration are significantly influenced by the gradient decay rate, particularly as confidence probability increases. Notably, the gradient decay varies in a convex or concave manner with rising sample probability. When employing a smaller gradient decay, we observe a curriculum learning sequence. This sequence highlights hard samples only after easy samples are adequately trained, and allows well-separated samples to receive a higher gradient, effectively reducing intra-class distances. However, this approach has a drawback: small gradient decay tends to exacerbate model overconfidence, shedding light on the calibration issues prevalent in modern neural networks. In contrast, a larger gradient decay addresses these issues effectively, surpassing even models that utilize post-calibration methods. Our findings provide substantial evidence that large margin Softmax can influence the local Lipschitz constraint by manipulating the probability-dependent gradient decay rate. This research contributes a fresh perspective and understanding of the interplay between large margin Softmax, curriculum learning, and model calibration through an exploration of gradient decay rates. Additionally, we propose a novel warm-up strategy that dynamically adjusts the gradient decay for a smoother L-constraint in early training, then mitigating overconfidence in the final model.
What problem does this paper attempt to address?