Abstract:Data sampling is an effective method to improve the training speed of neural networks, with recent results demonstrating that it can even break the neural scaling laws. These results critically rely on high-quality scores to estimate the importance of an input to the network. We observe that there are two dominant strategies: static sampling, where the scores are determined before training, and dynamic sampling, where the scores can depend on the model weights. Static algorithms are computationally inexpensive but less effective than their dynamic counterparts, which can cause end-to-end slowdown due to their need to explicitly compute losses. To address this problem, we propose a novel sampling distribution based on nonparametric kernel regression that learns an effective importance score as the neural network trains. However, nonparametric regression models are too computationally expensive to accelerate end-to-end training. Therefore, we develop an efficient sketch-based approximation to the Nadaraya-Watson estimator. Using recent techniques from high-dimensional statistics and randomized algorithms, we prove that our Nadaraya-Watson sketch approximates the estimator with exponential convergence guarantees. Our sampling algorithm outperforms the baseline in terms of wall-clock time and accuracy on four datasets.
What problem does this paper attempt to address?
The paper primarily aims to address the computational bottleneck encountered when training deep neural networks on large-scale datasets, specifically focusing on how to effectively accelerate the training process without sacrificing the final model performance. The authors propose a novel method, namely Adaptive Sampling for Deep Learning via Efficient Nonparametric Proxies, to achieve effective estimation of important samples through efficient nonparametric proxies.
Specifically, the key contributions and technical points presented in the paper are as follows:
1. **Problem Definition**: The paper first defines the problem background, stating that with the exponential growth of data, training large-scale neural networks faces challenges in terms of time, energy, and storage. Traditional full-dataset training methods become impractical, especially in scientific and industrial scenarios. Therefore, data selection has become a popular method to address this issue.
2. **Issues with Existing Methods**: Current data sampling strategies can be divided into static sampling and dynamic sampling. Static sampling methods determine the sampling scores before training begins, which, although computationally inexpensive, are less effective. Dynamic sampling methods can adjust sampling scores based on the current network state, theoretically yielding better results, but they reduce overall training efficiency due to the need for explicit loss computation.
3. **Proposed Solution**: To overcome the above limitations, the paper proposes a novel sampling distribution based on nonparametric kernel regression. This method can dynamically learn effective sample importance scores during neural network training while reducing computational complexity through efficient sketch-based approximation techniques.
4. **Key Technologies**:
- **Nadaraya-Watson Sketch (NWS)**: The paper develops a new sketch-based technique called Nadaraya-Watson Sketch (NWS), which can efficiently approximate the Nadaraya-Watson estimator and demonstrates exponential convergence guarantees.
- **Importance Sampling**: Using NWS as a subroutine, the paper proposes an online algorithm to predict the importance of samples for model training. This algorithm can dynamically estimate the loss value of each sample based on the observed loss sequence during training.
5. **Experimental Validation**: The paper validates the effectiveness of the proposed method through experiments on 4 different datasets. The experimental results show significant improvements in both accuracy and actual runtime compared to baseline methods.
In summary, the main goal of this paper is to accelerate the training process of deep learning models by proposing an efficient and adaptive sampling method, while maintaining or improving the final model performance.