Abstract:The amount of training-data is one of the key factors which determines the generalization capacity of learning algorithms. Intuitively, one expects the error rate to decrease as the amount of training-data increases. Perhaps surprisingly, natural attempts to formalize this intuition give rise to interesting and challenging mathematical questions. For example, in their classical book on pattern recognition, Devroye, Gyorfi, and Lugosi (1996) ask whether there exists a {monotone} Bayes-consistent algorithm. This question remained open for over 25 years, until recently Pestov (2021) resolved it for binary classification, using an intricate construction of a monotone Bayes-consistent algorithm. We derive a general result in multiclass classification, showing that every learning algorithm A can be transformed to a monotone one with similar performance. Further, the transformation is efficient and only uses a black-box oracle access to A. This demonstrates that one can provably avoid non-monotonic behaviour without compromising performance, thus answering questions asked by Devroye et al (1996), Viering, Mey, and Loog (2019), Viering and Loog (2021), and by Mhammedi (2021). Our transformation readily implies monotone learners in a variety of contexts: for example it extends Pestov's result to classification tasks with an arbitrary number of labels. This is in contrast with Pestov's work which is tailored to binary classification. In addition, we provide uniform bounds on the error of the monotone algorithm. This makes our transformation applicable in distribution-free settings. For example, in PAC learning it implies that every learnable class admits a monotone PAC learner. This resolves questions by Viering, Mey, and Loog (2019); Viering and Loog (2021); Mhammedi (2021).

On Biasing Transformer Attention Towards Monotonicity

Exact Hard Monotonic Attention for Character-Level Transduction

Enhancing Monotonicity for Robust Autoregressive Transformer TTS

Infusing Future Information into Monotonic Attention Through Language Models

Adaptive Sparse and Monotonic Attention for Transformer-based Automatic Speech Recognition

Learning Monotonic Attention in Transducer for Streaming Generation

Expressive Monotonic Neural Networks

Monotonic segmental attention for automatic speech recognition

Monotone Learning

Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic Alignment

Monotonic Location Attention for Length Generalization

Monotonic Alignments for Summarization

Linear Log-Normal Attention with Unbiased Concentration

Morphological Inflection Generation with Hard Monotonic Attention

Attention as an RNN

Efficient Monotonic Multihead Attention

Attention is All you Need

Optimizing Non-Autoregressive Transformers with Contrastive Learning

Incorporating Structural Alignment Biases into an Attentional Neural Translation Model

Transformers without Tears: Improving the Normalization of Self-Attention

Learning When to Concentrate or Divert Attention: Self-Adaptive Attention Temperature for Neural Machine Translation