Abstract:We consider the problem of learning a loss function which, when minimized over a training dataset, yields a model that approximately minimizes a validation error metric. Though learning an optimal loss function is NP-hard, we present an anytime algorithm that is asymptotically optimal in the worst case, and is provably efficient in an idealized "easy" case. Experimentally, we show that this algorithm can be used to tune loss function hyperparameters orders of magnitude faster than state-of-the-art alternatives. We also show that our algorithm can be used to learn novel and effective loss functions on-the-fly during training.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to learn a loss function so that when minimizing this loss function on the training data set, a model with as small an error metric as possible on the validation data set can be obtained. Specifically, the author focuses on finding a linear loss function that, when minimized, can produce models with lower test errors.
### Problem Background
In machine learning, most models are obtained by minimizing a certain loss function, but optimizing the training loss is usually not the ultimate goal. In fact, the performance evaluation of a model is usually based on unseen test data and uses performance metrics that may not be completely related to the training loss (for example, top - 1 error rate vs. log - loss). Therefore, choosing a good loss function is crucial for the final value of the model.
### Main Challenges
Although choosing a good loss function is very important, it is still unknown whether commonly used loss functions (such as log - loss) are close to optimal. For example, in the ImageNet classification task, state - of - the - art models are trained by minimizing the log - loss on the training data, but top - 1 or top - 5 accuracy is used during evaluation. So, are there other loss functions that can make these evaluation metrics perform better?
### Core Problem of the Paper
The goal of the paper is to learn a loss function so that after approximately minimizing this loss function on the training data, it can show good performance on the test data according to certain error metrics. The error metric does not have to be differentiable and may have only a loose relationship with the loss function.
### Mathematical Representation
Suppose there is a set of models \(\Theta\subseteq\mathbb{R}^n\), and a test error \(e:\Theta\rightarrow\mathbb{R}_{\geq0}\). Our goal is to find a training loss function \(\ell:\Theta\rightarrow\mathbb{R}_{\geq0}\) such that it belongs to a certain set \(L\) of possible loss functions. We hope to find an \(\ell\in L\) such that the model \(\hat{\theta}(\ell)\) obtained after minimizing \(\ell\) performs best on the test error \(e\). That is, to solve the bi - level minimization problem:
\[
\min_{\ell\in L}e(\hat{\theta}(\ell))
\]
where
\[
\hat{\theta}(\ell)=\arg\min_{\theta\in\Theta}\ell(\theta)
\]
### Application Scenarios
This problem has multiple application scenarios, including but not limited to:
1. **Adjusting hyper - parameters of the loss function**: For example, when performing softmax regression, using L1 and L2 regularization simultaneously.
2. **Learning data augmentation strategies**: For example, randomly applying image transformations in the ImageNet classification task.
3. **Learning new regularizers**: For example, using the learned convex function as a regularization term.
### Conclusion
Although computing the optimal linear loss function is an NP - hard problem, the author proposes an asymptotically optimal algorithm, LearnLoss, which can efficiently find an approximately optimal loss function in an ideal situation. Experimental results show that this algorithm is several orders of magnitude faster than existing methods in adjusting the hyper - parameters of the loss function and can prevent over - fitting during a single training process, thereby improving the generalization ability of the model.