Abstract:Active learning (AL) optimizes data labeling efficiency by selecting the most informative instances for annotation. A key component in this procedure is an acquisition function that guides the selection process and identifies the suitable instances for labeling from the unlabeled pool. However, these acquisition methods suffer from high computational costs with large unlabeled data pools, posing a roadblock to their applicability on large datasets. To address this challenge and bridge this gap, we introduce a novel plug-and-play unlabeled data pruning strategy, ActivePrune, which leverages language models to prune the unlabeled pool. ActivePrune implements a two-stage pruning process: an initial fast evaluation using perplexity scores from an n-gram language model, followed by a high-quality selection using metrics for data quality computed through a quantized LLM. Additionally, to enhance the diversity in the unlabeled pool, we propose a novel perplexity reweighting method that systematically brings forward underrepresented instances for selection in subsequent labeling iterations. Experiments on translation, sentiment analysis, topic classification, and summarization tasks on four diverse datasets and four active learning strategies demonstrate that ActivePrune outperforms existing data pruning methods. Finally, we compare the selection quality $\leftrightarrow$ efficiency tradeoff of the data pruning methods and demonstrate that ActivePrune is computationally more efficient than other LLM score-based pruning methods, and provides up to 74% reduction in the end-to-end time required for active learning.

Measuring Sample Importance in Data Pruning for Language Models based on Information Entropy

Improving Language Model Size Reduction Using Better Pruning Criteria

Infrared interferometric observations of young stellar objects

Text Quality-Based Pruning for Efficient Training of Language Models

Shortened LLaMA: Depth Pruning for Large Language Models with Comparison of Retraining Methods

Neural Language Model Pruning for Automatic Speech Recognition

Pruning Pre-trained Language Models with Principled Importance and Self-regularization

Large Language Model Pruning

Dissecting Language Models: Machine Unlearning via Selective Pruning

How to Prune Your Language Model: Recovering Accuracy on the "Sparsity May Cry'' Benchmark

Language Model-Driven Data Pruning Enables Efficient Active Learning

Beware of Calibration Data for Pruning Large Language Models

LLM-Rank: A Graph Theoretical Approach to Pruning Large Language Models

Investigating Language-Specific Calibration For Pruning Multilingual Large Language Models

When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale

Efficient Contextualized Representation: Language Model Pruning for Sequence Labeling

Pruning as a Domain-specific LLM Extractor

SparseLLM: Towards Global Pruning for Pre-trained Language Models

One-Shot Sensitivity-Aware Mixed Sparsity Pruning for Large Language Models

AlphaPruning: Using Heavy-Tailed Self Regularization Theory for Improved Layer-wise Pruning of Large Language Models

Layer-wise Model Pruning Based on Mutual Information