Optimizing Inference Quality with SmartNIC for Recommendation System

Ruixin Shi,Ming Yan,Jie Wu
DOI: https://doi.org/10.1109/iwqos61813.2024.10682873
2024-01-01
Abstract:Embedding-based recommendation systems are now widely used to recommend content for users, and have strict requirements on their latency and throughput. However, the latest recommendation models often exceed GPU HBM memory capacity, and the system is often deployed separately on computing nodes for GPU calculating and Parameter Servers for embedding tables’ storage. This architecture leads to a significant amount of network I/O during the inference process and reduces GPU utilization.In this paper, we propose SmartEmb, an inference framework that accelerates the network I/O of embedding table lookups through a specialized control plane of task reordering, prefetching and cache management. We offload these control planes on SmartNIC to avoid contention with the host CPU and gain better performance. We implemented the SmartEmb prototype on BlueField-2 and evaluated its performance. Our evaluation demonstrates that compared to the Nvidia HugeCTR HPS, SmartEmb can improve the quality of service by achieving up to 217% improvement in throughput and reducing latency by up to 190% of overall embedding layer look-ups in inference scenarios.
What problem does this paper attempt to address?