Abstract:Softmax with the cross entropy loss is the standard configuration for current neural text classification models. The gold score for a target class is supposed to be 1, but it is never reachable under the softmax schema. Such a problem makes the training process continue forever and leads to overfitting. Moreover, the “target-approach-1” training goal forces the model to continuously learn all samples, leading to a waste of time in handling some samples which have already been classified correctly with high confidence, while the test goal simply requires the target class of each sample to hold the maximum score. To solve the above weaknesses, we propose the \textbf{A}daptive \textbf{S}parse softmax (AS-Softmax) which designs a reasonable and test-matching transformation on top of softmax. For more purposeful learning, we discard the classes with far smaller scores compared with the actual class during training. Then the model could focus on learning to distinguish the target class from its strong opponents, which is also the great challenge in test. In addition, since the training losses of easy samples will gradually drop to 0 in AS-Softmax, we develop an adaptive gradient accumulation strategy based on the masked sample ratio to speed up training. We verify proposed AS-Softmax on a variety of multi-class, multi-label and token classification tasks with class sizes ranging from 5 to 5000+. The results show that AS-Softmax consistently outperforms softmax and its variants, and the loss of AS-Softmax is remarkably correlated with classification performance in validation. Furthermore, adaptive gradient accumulation strategy can bring about 1.2× training speedup comparing with the standard softmax while maintaining classification effectiveness.

Regularization and Iterative Initialization of Softmax for Fast Training of Convolutional Neural Networks.

Batch-Normalization-based Soft Filter Pruning for Deep Convolutional Neural Networks

Structured Pruning for Efficient Convolutional Neural Networks Via Incremental Regularization

$\mathcal{G}$-softmax: Improving Intra-class Compactness and Inter-class Separability of Features

Learning deep discriminative embeddings via joint rescaled features and log-probability centers

Softmax Dissection: Towards Understanding Intra- and Inter-class Objective for Embedding Learning

Class-Variant Margin Normalized Softmax Loss for Deep Face Recognition

An Aggressive Reduction on the Complexity of Optimization for Non-Strongly Convex Objectives

Imbalance Robust Softmax for Deep Embeeding Learning

Improving Classification Performance of Softmax Loss Function Based on Scalable Batch-Normalization

On the Learning Property of Logistic and Softmax Losses for Deep Neural Networks

Isomorphic Model-Based Initialization for Convolutional Neural Networks

Regularizing Deep Convolutional Neural Networks with a Structured Decorrelation Constraint.

SubFace: learning with softmax approximation for face recognition

Attention Scheme Inspired Softmax Regression

Effective Domain Knowledge Transfer with Soft Fine-tuning

Improving Training of Deep Neural Networks Via Singular Value Bounding

Integrating Convolution and Sparse Coding for Learning Low-Dimensional Discriminative Image Representations

Softmax-Free Linear Transformers

Adaptive Sparse Softmax: An Effective and Efficient Softmax Variant for Text Classification

ConvBLS: An Effective and Efficient Incremental Convolutional Broad Learning System for Image Classification