CAFE: Towards Compact, Adaptive, and Fast Embedding for Large-scale Recommendation Models

Hailin Zhang,Zirui Liu,Boxuan Chen,Yikai Zhao,Tong Zhao,Tong Yang,Bin Cui
2024-03-27
Abstract:Recently, the growing memory demands of embedding tables in Deep Learning Recommendation Models (DLRMs) pose great challenges for model training and deployment. Existing embedding compression solutions cannot simultaneously meet three key design requirements: memory efficiency, low latency, and adaptability to dynamic data distribution. This paper presents CAFE, a Compact, Adaptive, and Fast Embedding compression framework that addresses the above requirements. The design philosophy of CAFE is to dynamically allocate more memory resources to important features (called hot features), and allocate less memory to unimportant ones. In CAFE, we propose a fast and lightweight sketch data structure, named HotSketch, to capture feature importance and report hot features in real time. For each reported hot feature, we assign it a unique embedding. For the non-hot features, we allow multiple features to share one embedding by using hash embedding technique. Guided by our design philosophy, we further propose a multi-level hash embedding framework to optimize the embedding tables of non-hot features. We theoretically analyze the accuracy of HotSketch, and analyze the model convergence against deviation. Extensive experiments show that CAFE significantly outperforms existing embedding compression methods, yielding 3.92% and 3.68% superior testing AUC on Criteo Kaggle dataset and CriteoTB dataset at a compression ratio of 10000x. The source codes of CAFE are available at GitHub.
Machine Learning
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of excessive memory requirements for embedding tables in large - scale deep learning recommendation models (DLRM). With the exponential growth of the number of categorical features in DLRM, the memory requirements for embedding tables also increase dramatically, bringing huge storage challenges to the training and deployment of the model. Existing embedding compression methods cannot simultaneously meet the following three key design requirements: 1. **Memory efficiency**: Effectively compress the embedding table within a limited storage space without reducing the model accuracy. 2. **Low latency**: Ensure that the compression method does not significantly increase the latency of inference and serving. 3. **Adapt to dynamic data distribution**: Be able to adapt to changes in the data distribution during online training. To solve these problems, the authors propose a compact, adaptive, and fast embedding compression framework - CAFE (Compact, Adaptive, and Fast Embedding). Specifically, CAFE meets the above requirements in the following ways: - **Memory efficiency**: CAFE dynamically allocates memory resources according to the importance of features. Important features (called hot features) are assigned unique embedding vectors, while non - important features share embedding vectors. By introducing a lightweight sketch data structure, HotSketch, to capture the importance of features and report hot features in real - time, a high compression ratio is achieved. - **Low latency**: CAFE involves only a few hash operations and one additional embedding lookup, and the time cost is negligible, so it maintains low latency during the serving process. - **Adapt to dynamic data distribution**: CAFE includes an embedding migration process. When the importance score of a feature changes, a migration is triggered to ensure that important features can be identified even when the data distribution changes. In addition, to further optimize the embedding table of non - hot features, CAFE proposes a multi - level hash embedding framework, which divides non - hot features into multiple levels according to their importance scores and assigns different numbers of hash embedding vectors, thereby improving the model performance. Experimental results show that CAFE achieves a 3.92% and 3.68% improvement in test AUC on the Criteo Kaggle dataset and the CriteoTB dataset respectively, and significantly outperforms existing embedding compression methods when the compression ratio is 10,000 times. ### Summary CAFE successfully solves the problem of excessive memory requirements for embedding tables in large - scale DLRM by dynamically allocating memory resources to important features and using lightweight data structures and efficient migration strategies, while meeting the requirements of memory efficiency, low latency, and adaptation to dynamic data distribution.