PreSto: An In-Storage Data Preprocessing System for Training Recommendation Models

Yunjae Lee,Hyeseong Kim,Minsoo Rhu
2024-06-11
Abstract:Training recommendation systems (RecSys) faces several challenges as it requires the "data preprocessing" stage to preprocess an ample amount of raw data and feed them to the GPU for training in a seamless manner. To sustain high training throughput, state-of-the-art solutions reserve a large fleet of CPU servers for preprocessing which incurs substantial deployment cost and power consumption. Our characterization reveals that prior CPU-centric preprocessing is bottlenecked on feature generation and feature normalization operations as it fails to reap out the abundant inter-/intra-feature parallelism in RecSys preprocessing. PreSto is a storage-centric preprocessing system leveraging In-Storage Processing (ISP), which offloads the bottlenecked preprocessing operations to our ISP units. We show that PreSto outperforms the baseline CPU-centric system with a $9.6\times$ speedup in end-to-end preprocessing time, $4.3\times$ enhancement in cost-efficiency, and $11.3\times$ improvement in energyefficiency on average for production-scale RecSys preprocessing.
Hardware Architecture,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the challenges faced in the data pre - processing stage during the training process of the recommendation system (RecSys). Specifically, the paper focuses on the following main issues: 1. **High computational cost and energy consumption**: - Current solutions usually use a large number of CPU servers for data pre - processing to ensure that the GPU can obtain sufficient pre - processed data during training. Although this method is effective, it will lead to significant deployment costs and power consumption. 2. **Performance bottleneck**: - Existing CPU - based data pre - processing methods have performance bottlenecks in feature generation and feature normalization operations. These operations account for most of the pre - processing time (about 79%), and due to the failure to fully utilize the parallelism between features, the performance is low. - In the online pre - processing method, if the CPU cannot provide sufficient computing power to generate enough training - ready tensors, it will lead to extremely low GPU utilization (less than 20%), thus affecting the overall training efficiency. 3. **Storage and bandwidth issues**: - Offline pre - processing requires a large amount of storage space to save the pre - processed data and is difficult to adapt to the changes of newly developed recommendation system models. - Online pre - processing avoids additional storage requirements, but due to the need to frequently transmit raw data and pre - processed tensors on the network, it leads to an increase in network traffic. ### Proposed solutions To solve the above problems, the paper proposes a data pre - processing system named PreSto based on In - Storage Processing (ISP). The main features of PreSto include: - **Offloading pre - processing tasks to storage devices**: Through ISP technology, PreSto directly performs pre - processing operations in storage devices, reducing the network overhead caused by data transmission. - **Accelerating feature generation and normalization**: Use the accelerator in the ISP unit to efficiently perform feature generation and normalization operations, give full play to the parallelism between features, and significantly improve pre - processing performance. - **Reducing costs and energy consumption**: Compared with the traditional CPU - centric pre - processing scheme, PreSto significantly reduces deployment costs and energy consumption while maintaining high performance. ### Experimental results Experiments show that compared with the baseline CPU - centric system, PreSto improves the end - to - end pre - processing time by 9.6 times, the cost - effectiveness by 4.3 times, and the energy efficiency by 11.3 times. ### Formula representation The formulas and algorithms involved in the paper are as follows: #### Feature generation (Bucketize) ```markdown \[ \text{Algorithm 1: Bucketize for feature generation} \] 1. Input: Dense features \( a[1 \dots n] \), bucket boundaries \( b[1 \dots m] \); Output \( c[1 \dots n] \) 2. /* Discretize the input dense features according to the bucket boundaries */ 3. for \( i \leftarrow 1 \) to \( n \) do 4. /* Use the binary search algorithm to find the bucket index to which the input value belongs */ 5. \( c[i] \leftarrow \text{SearchBucketID}(a[i], b[1 \dots m]) \) 6. end for ``` #### Feature normalization (SigridHash) ```markdown \[ \text{Algorithm 2: SigridHash for feature normalization} \] 1. Input: Sparse features \( a[1 \dots n] \), seed \( s \), maximum value \( d \); Output \( c[1 \dots n] \) 2. /* Apply the hash function to the input sparse features and limit their values */ 3. for \( i \leftarrow 1 \) to \( n \) do 4. /*