Fast and Scalable VMM Live Upgrade in Large Cloud Infrastructure

Xiantao Zhang,Xiao Zheng,Zhi Wang,Qi Li,Junkang Fu,Yang Zhang,Yibin Shen
DOI: https://doi.org/10.1145/3297858.3304034
2019-01-01
Abstract:High availability is the most important and challenging problem for cloud providers. However, virtual machine monitor (VMM), a crucial component of the cloud infrastructure, has to be frequently updated and restarted to add security patches and new features, undermining high availability. There are two existing live update methods to improve the cloud availability: kernel live patching and Virtual Machine (VM) live migration. However, they both have serious drawbacks that impair their usefulness in the large cloud infrastructure: kernel live patching cannot handle complex changes (e.g., changes to persistent data structures); and VM live migration may incur unacceptably long delays when migrating millions of VMs in the whole cloud, for example, to deploy urgent security patches. In this paper, we propose a new method, VMM live upgrade, that can promptly upgrade the whole VMM (KVM & QEMU) without interrupting customer VMs. Timely upgrade of the VMM is essential to the cloud because it is both the main attack surface of malicious VMs and the component to integrate new features. We have built a VMM live upgrade system called Orthus. Orthus features three key techniques: dual KVM, VM grafting, and device handover. Together, they enable the cloud provider to load an upgraded KVM instance while the original one is running and "cut-and-paste'' the VM to this new instance. In addition, Orthus can seamlessly hand over passthrough devices to the new KVM instance without losing any ongoing (DMA) operations. Our evaluation shows that Orthus can reduce the total migration time and downtime by more than $99%$ and $90%$, respectively. We have deployed Orthus in one of the largest cloud infrastructures for a long time. It has become the most effective and indispensable tool in our daily maintenance of hundreds of thousands of servers and millions of VMs.
What problem does this paper attempt to address?