Cache Antagonists Identification: A Practice from Alibaba Colocation Datacenter

Kangjin Wang,Chuanjia Hou,Ying Li,Yaoyong Dou,Cheng Wang,Yang Wen,Jie Yao,Liping Zhang
DOI: https://doi.org/10.1109/ISSREW55968.2022.00031
2022-01-01
Abstract:Colocating latency-critical (LC) jobs and best-effort (BE) jobs on a host effectively improve resource efficiency in modern datacenters. But it increases resource contention between jobs, which seriously affects job performance. In Alibaba's real-world LC-BE colocation datacenters, we observed that cache is one of the most contended resources in the CPU. When cache contention occurs, identifying the antagonists that caused cache resource contention is the first step to mitigate cache contention, called cache antagonists identification (CAI). However, it is challenging to identify cache antagonists because cache contention is difficult to observe and quantify. In this paper, we first propose cache usage graph (CUG) to finely characterize cache usage of jobs in the multiple CPU microarchitectural hierarchies and locations, and we provide a monitoring tool to collect "per-container-per-logic CPU" L1/2/3 cache misses and build CUG. Then we propose a CUG-based CAI approach, mu Tactic. mu Tactic leverages machine learning models to quantify the cache contention on every cache hierarchy, then reasons out the cache antagonists with CUG. Experiments in production datacenters show that mu Tactic has a high precision (85+%) and low cost (32 ms), which are better than state-of-the-art approaches.
What problem does this paper attempt to address?