Privilege-Escalation Vulnerability Discovery for Large-scale RPC Services: Principle, Design, and Deployment

Zhuotao Liu,Hao Zhao,Sainan Li,Qi Li,Tao Wei,Yu Wang
DOI: https://doi.org/10.1145/3433210.3453076
2021-01-01
Abstract:ABSTRACTRPCs are fundamental to our large-scale distributed system. From a security perspective, the blast radius of RPCs is worryingly big since each RPC often interacts with tens of internal system components. Thus, discovering RPC vulnerabilities is often a top priority in the software quality assurance process for production systems. In this paper, we present the design, implementation, and deployment experiences of PAIR, a fully automated system for privilege-escalation vulnerability discovery in Ant Group's large-scale RPC system. The design of PAIR centers around the live replay design principle where the vulnerability discovery is driven by the live RPC requests collected from production, rather than relying on any engineered testing requests. This ensures that PAIR is able to provide complete coverage to our production RPC requests in a privacy-preserving manner, despite the manifest of scale (billions of daily requests), complexity (hundreds of system-services involved) and heterogeneity (RPC protocols are highly customized). However, the live replay design principle is not a panacea. We made a couple of critical design decisions (and addressed their corresponding challenges) along the way to realize the principle in production. First, to avoid inspecting the responses of user-facing RPCs (due to privacy concerns), PAIR designs a universal and privacy-preserving mechanism, via profiling the end-to-end system invocation, to represent the RPC handling logic. Second, to ensure that PAIR provides proactive defense (rather than reactive defense that is often limited by known vulnerabilities), PAIR designs an empirical vulnerability labeling mechanism to effectively identify a group of potentially insecure RPCs while safely excluding other RPCs. During the course of three-year production development, PAIR in total helped locate 133 truly insecure RPCs, from billions of requests, while maintaining a zero false negative rate per our production observations.
What problem does this paper attempt to address?