Constrained Labeling for Weakly Supervised Learning

Chidubem Arachie,Bert Huang
DOI: https://doi.org/10.48550/arXiv.2009.07360
2021-05-30
Abstract:Curation of large fully supervised datasets has become one of the major roadblocks for machine learning. Weak supervision provides an alternative to supervised learning by training with cheap, noisy, and possibly correlated labeling functions from varying sources. The key challenge in weakly supervised learning is combining the different weak supervision signals while navigating misleading correlations in their errors. In this paper, we propose a simple data-free approach for combining weak supervision signals by defining a constrained space for the possible labels of the weak signals and training with a random labeling within this constrained space. Our method is efficient and stable, converging after a few iterations of gradient descent. We prove theoretical conditions under which the worst-case error of the randomized label decreases with the rank of the linear constraints. We show experimentally that our method outperforms other weak supervision methods on various text- and image-classification tasks.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: in weakly - supervised learning, how to reliably combine multiple weakly - supervised signals from different sources to train an accurate model. Specifically, the main problems mentioned in the paper include: 1. **High cost of data annotation**: Collecting large - scale, high - quality annotated datasets is crucial for the training of deep - learning models, but this process is usually very expensive and time - consuming. 2. **Error correlation of weakly - supervised signals**: Different weakly - supervised signals may have misleading error correlations, and simply combining these signals may damage the quality of the model. 3. **Coverage and accuracy of weakly - supervised signals**: Each weakly - supervised signal has its own bias and may only annotate part of the data, resulting in incomplete coverage or insufficient accuracy. To solve these problems, the authors propose a method named **Constrained Label Learning (CLL)**. The core idea of CLL is to handle various weakly - supervised signals by defining a constraint space and randomly select labels from it as training labels. This method aims to improve the quality of training labels, thereby enhancing the performance of the model. ### Specific method CLL is implemented through the following steps: - **Define the constraint space**: Define a constraint space containing the true labels according to the expected error of the weakly - supervised signals. - **Randomly select labels**: Randomly select a label vector from this constraint space as the training label. - **Optimize label selection**: Find the label vector that satisfies the constraint conditions by minimizing the quadratic penalty term for violating the constraint conditions. ### Theoretical analysis The authors also provide theoretical analysis and prove that under certain conditions, the worst - case error of random labels will decrease as the rank of the linear constraint increases. This indicates that even if there is redundancy in weakly - supervised signals, CLL can still work effectively. ### Experimental verification The experimental results show that CLL outperforms other weakly - supervised methods in multiple text and image classification tasks, especially when dealing with highly - dependent weakly - supervised signals. In conclusion, this paper aims to solve the label quality and model performance problems in weakly - supervised learning through the CLL method and provides an efficient and stable solution.