Microsoft CloudMine: Data Mining for the Executive Order on Improving the Nation's Cybersecurity

Kim Herzig,Hitesh Sajnani,Varsha Vadaga,Myles McLeroy,Maximilian Grothusmann,Yashasvini Ramkumar,Kivanç Muslu,Sascha Just,Nora Huang,Luke Ghostling,Alan Klimowski
DOI: https://doi.org/10.1145/3524842.3528514
2022-05-01
Abstract:As any other US software maker, Microsoft is bound by the “Executive Order on Improving the Nation's Cybersecurity” [2] which dictates a clear mandate to “enhance the software supply chain security” and to generally improve the cyber security practices. However, this is much easier written down than enforced. The executive order imposes new rules and requirements that will impact engineering practices and evidence collection for most projects and engineering teams in a relatively short period of time. Part of the response is the requirement to build up comprehensive inventories of software artifacts contributing to US government systems, which is a massive task when done manually would be tedious and fragile as software eco-systems change rapidly. Required is a system that will constantly monitor and update the inventory of software artifacts and contributors so that at any given point of time, the scope and involved teams for any software security incident can be notified and response plans activated. The front line of this security battle includes data mining platforms providing the security and compliance teams with engineering artifacts and insights into artifact dependencies and engineering practices of the corresponding engineering teams. The data provided does not only allow Microsoft to build an accurate engineering artifact inventory, but also enables Microsoft�s teams to initiate so called “get-clean” initiatives to start issue remediation before proper policy tools and pipelines (“stay-clean”) can be developed, tested, and deployed. In this talk we will present CloudMine1, one of Microsoft's main data mining platforms serving data sets and dependency graphs of more than 270 different engineering artifacts (e.g., builds, releases, commits, pull requests, etc.) gathered on an hourly basis. During the talk we will provide some insights into CloudMine, its engineering team and operational costs-which is significant. We will then highlight the benefits and opportunities a data mining framework like CloudMine provides the company including insights into how inventory and automation bots use CloudMine data to impact thousands of Microsoft engineers daily, saving the company significant costs and response times to security incidents: the ability to scan more than 100,000 code repositories across the enterprise within hours; building up an artifact engineering inventory enabling us to flag any known security vulnerability in any of the software components within hours; or spotting non-compliant build and release pipelines across Microsoft's 500,000 pipelines. In addition, we will also present open challenges the CloudMine engineering team is facing during operating and growing CloudMine as a platform, which will hopefully provide motivation and inspiration for researcher and other companies to start a dialog with us and other companies about these challenges and latest research results that may help us solve these issues. From the talk it should become clear that running enterprise scale systems is not cheap but worth the effort as it enables Microsoft and its engineering teams to respond to current cyber security threads even before we can build and test best in class built-in defense systems.
Engineering,Computer Science
What problem does this paper attempt to address?