Icache: an Importance-Sampling-Informed Cache for Accelerating I/O-Bound DNN Model Training.

Weijian Chen,Shuibing He,Yaowen Xu,Xuechen Zhang,Siling Yang,Shuang Hu,Xian-He Sun,Gang Chen
DOI: https://doi.org/10.1109/hpca56546.2023.10070964
2023-01-01
Abstract:Fetching a large amount of DNN training data from storage systems incurs long I/O latency and fetch stalls of GPUs. Importance sampling in DNN training can reduce the amount of data computing on GPUs while maintaining a similar model accuracy. However, existing DNN training frameworks do not have a cache layer that reduces the number of data fetches and manages cached items according to sample importance, resulting in unnecessary data fetches, poor cache hit ratios, and random I/Os when importance sampling is used.In this paper, we design a new importance-sampling-informed cache, namely, iCache, to accelerate I/O bound DNN training jobs. iCache only fetches parts of samples instead of all samples in the dataset. The cache is partitioned into two regions: H-cache and L-cache, which store samples of high importance and low importance respectively. Rather than using recency or frequency, we manage data items in H-cache according to their corresponding sample importance. When there is a cache miss in L-cache, we use sample substitutability and dynamic packaging to improve the cache hit ratio and reduce the number of random I/Os. When multiple concurrent jobs access the same datasets in H-cache, we design a model to assign the relative importance values to cached samples to avoid cache thrashing, which may happen when there is no coordination among the concurrent training jobs. Our experimental results show that iCache has a negligible impact on training accuracy and speeds up the DNN training time by up to 2.0× compared to the state-of-the-art caching systems.
What problem does this paper attempt to address?