Abstract:The amount of training-data is one of the key factors which determines the generalization capacity of learning algorithms. Intuitively, one expects the error rate to decrease as the amount of training-data increases. Perhaps surprisingly, natural attempts to formalize this intuition give rise to interesting and challenging mathematical questions. For example, in their classical book on pattern recognition, Devroye, Gyorfi, and Lugosi (1996) ask whether there exists a {monotone} Bayes-consistent algorithm. This question remained open for over 25 years, until recently Pestov (2021) resolved it for binary classification, using an intricate construction of a monotone Bayes-consistent algorithm.
We derive a general result in multiclass classification, showing that every learning algorithm A can be transformed to a monotone one with similar performance. Further, the transformation is efficient and only uses a black-box oracle access to A. This demonstrates that one can provably avoid non-monotonic behaviour without compromising performance, thus answering questions asked by Devroye et al (1996), Viering, Mey, and Loog (2019), Viering and Loog (2021), and by Mhammedi (2021).
Our transformation readily implies monotone learners in a variety of contexts: for example it extends Pestov's result to classification tasks with an arbitrary number of labels. This is in contrast with Pestov's work which is tailored to binary classification.
In addition, we provide uniform bounds on the error of the monotone algorithm. This makes our transformation applicable in distribution-free settings. For example, in PAC learning it implies that every learnable class admits a monotone PAC learner. This resolves questions by Viering, Mey, and Loog (2019); Viering and Loog (2021); Mhammedi (2021).
What problem does this paper attempt to address?
The core problem that this paper attempts to solve is whether the performance of machine - learning algorithms monotonically decreases when the amount of training data is increased. Specifically, the paper explores whether there exists a learning algorithm that can ensure that the overall loss (or error rate) will not increase when more training data is obtained, that is, it exhibits monotonically decreasing behavior. Since this problem was proposed by Devroye, Györfi, and Lugosi (1996), it has been an open problem, especially in binary classification tasks, until Pestov (2021) solved the problem in the binary classification case.
The main contributions of the paper are as follows:
1. **Proposing a general result**: The paper proves that for multi - classification tasks, any learning algorithm \(A\) can be converted into a monotonic learning algorithm \(M\), and this conversion process is efficient, which can be completed only by black - box access to \(A\). This shows that non - monotonic behavior can be avoided without sacrificing performance, thus answering the questions raised by Devroye et al. (1996), Viering et al. (2019, 2021) and Mhammedi (2021).
2. **Extending Pestov's result**: By applying this conversion method, Pestov's result can be generalized to classification tasks with an arbitrary number of labels, not just binary classification tasks. This makes this method applicable in a wider range of situations.
3. **Providing a uniform bound on the error**: The paper also provides a uniform bound on the error of the monotonic algorithm, which enables this conversion method to be applied without distribution assumptions. For example, in the PAC - learning framework, this means that for each learnable class, there exists a monotonic PAC - learner.
### Main technical contributions of the paper
1. **Constructing a general framework**: The paper develops a general axiomatic framework for constructing a conversion method that converts any learner into a monotonic learner with similar guarantees. The core of this framework is to construct a small and symmetric hypothesis class \(B_h\) for each hypothesis \(h\), such that \(h\in B_h\), and \(B_h\) can be learned by a monotonic learner. For example, in binary classification tasks, \(B_h = \{h, 1 - h\}\), while in multi - classification tasks, \(B_h=\{s_i\circ h:i\in[k]\}\), where \(s_i\) is a cyclic permutation of the labels.
2. **Proving the main theorem**: The paper uses the above framework to prove the main theorem (Theorem 1.2) in Sections 3 and 4. Section 3 focuses on binary classification tasks as a warm - up for the more general multi - classification setting, which is discussed in Section 4. The most complex part of the proof is the proof of Proposition 4.1, especially Lemma 4.2, which asserts that the randomized empirical risk minimizer (ERM) is monotonic on \(B_h\).
### Related work
- **The concept of monotonic learning curves**: It was originally proposed by Devroye, Györfi, and Lugosi (1996), but it has not attracted wide attention until recent years.
- **Other research**: Viering, Mey, and Loog (2019, 2020) and Mhammedi (2021) studied methods for converting a given learner into a monotonic learner and proposed some weak forms of monotonicity.
- **Pestov's work**: Pestov (2021) solved the problem in binary classification tasks, and this paper extends his results to multi - classification tasks.
### Conclusion
By providing a general conversion method, this paper proves that any learning algorithm can be converted into a monotonic learning algorithm without sacrificing performance. This not only answers long - standing theoretical questions but also provides new tools and methods for practical applications.