ActiveDP: Bridging Active Learning and Data Programming

Naiqing Guan,Nick Koudas
2024-02-09
Abstract:Modern machine learning models require large labelled datasets to achieve good performance, but manually labelling large datasets is expensive and time-consuming. The data programming paradigm enables users to label large datasets efficiently but produces noisy labels, which deteriorates the downstream model's performance. The active learning paradigm, on the other hand, can acquire accurate labels but only for a small fraction of instances. In this paper, we propose ActiveDP, an interactive framework bridging active learning and data programming together to generate labels with both high accuracy and coverage, combining the strengths of both paradigms. Experiments show that ActiveDP outperforms previous weak supervision and active learning approaches and consistently performs well under different labelling budgets.
Machine Learning,Databases
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the contradiction between the need for large - scale labeled data in the training process of modern machine - learning models and the high cost and time - consuming nature of manual data labeling. Specifically, the paper proposes an interactive framework named ActiveDP, which combines the advantages of two paradigms, active learning and data programming, to generate labels with high accuracy and wide coverage. #### Background problems 1. **Requirement for large - scale labeled data**: Modern machine - learning models require a large amount of labeled data to achieve good performance, but manually labeling these data is both expensive and time - consuming. 2. **Limitations of data programming**: Data programming can quickly label a large amount of data through weakly - supervised methods, but the generated labels usually have noise, thus affecting the performance of downstream models. 3. **Limitations of active learning**: Active learning can obtain high - quality labels, but can only label a small part of instances and it is difficult to cover the entire data set. #### Proposed solutions To solve the above problems, the paper proposes a new interactive framework named ActiveDP, whose main goals are: - **Combining the advantages of active learning and data programming**: By combining these two paradigms, ActiveDP can improve label coverage while ensuring label quality. - **Balancing label quality and quantity**: ActiveDP uses the prediction results of weakly - supervised and active - learning models to balance the quality and quantity of labels. - **Improving label quality**: In the post - training stage, ActiveDP further improves label quality by aggregating the prediction results of the label model and the active - learning model through the ConFusion method. #### Main contributions 1. **Proposing a new interactive framework, ActiveDP**, exploring the design space between active learning and data programming and combining the advantages of both. 2. **Designing a variety of novel strategies**, including the ConFusion method for label aggregation, the ADP sampler for query instance selection, and the LabelPick method for LF filtering, to improve the efficiency of ActiveDP and the quality of generated labels. 3. **Verifying the effectiveness of ActiveDP through extensive experiments**, showing that it is superior to baseline methods in providing high - coverage and high - precision labels and enhancing the performance of downstream models. In summary, this paper proposes a new framework, ActiveDP, by combining active learning and data programming, which solves the contradiction between the need for large - scale labeled data and the cost of manual labeling, while improving label quality and coverage.