Abstract:The pandemic of coronavirus has dramatically disrupted the retail industry, as many stores are forced to close and people across the world are shelter-in-place with online shopping as the inevitable choice. To meet the rapidly increasing demand for e-commerce, more data centers are expected to provide new or significantly improve existing cloud services that can better support hybrid workloads (e.g. online purchase jobs and batch jobs that support ranking or recommendation systems). Successful cloud systems need to efficiently handle and quickly respond to huge volume of traffic with such hybrid workloads. Meanwhile, it is critical to reduce the total cost of ownership (TCO) for profitability. Improving system utilization is one of the effective techniques to achieve the twin goals of high performance and low TCO. This paper conducts a comprehensive analysis on the 2017 and 2018 cluster traces released by Alibaba, which provides a case study about Alibaba's best practices in improving the performance and cost efficiency of its large-scale cloud systems by consolidating time-sensitive online service jobs with time-insensitive batch jobs. Our investigation indicates that the over-subscription (causing resource waste and low utilization) and under-subscription (causing performance degradation) problems co-exist in the current Alibaba system. We develop a simulator that allows us to evaluate possible solutions to address this problem and their impact on the performance, energy consumption, and TCO. Our experiments show that the estimated TCO can be reduced by $600,000 for the 2018 trace running on over 4,000 machines without compromising performance. The TCO can decrease by nearly $68 million if similar strategy is extrapolated to Alibaba's 432,000 web facing servers.

To store or not: Online cost optimization for running big data jobs on the cloud

Moving Big Data to The Cloud: An Online Cost-Minimizing Approach

A Cost-Effective Strategy for Storing Scientific Datasets with Multiple Service Providers in the Cloud

Cost-Efficient Vm Configuration Algorithm In The Cloud Using Mix Scaling Strategy

Probability-Based Online Algorithm for Switch Operation of Energy Efficient Data Center

Towards Optimizing Storage Costs on the Cloud

To Reserve or Not to Reserve: Optimal Online Multi-Instance Acquisition in IaaS Clouds

Saving Money for Analytical Workloads in the Cloud

Moving big data to the cloud

Online Cost Minimization for Operating Geo-Distributed Cloud CDNs

Towards Cost-Optimal Policies for DAGs to Utilize IaaS Clouds with Online Learning

Rethinking the Cloudonomics of Efficient I/O for Data-Intensive Analytics Applications

Improving the cost efficiency of large-scale cloud systems running hybrid workloads - A case study of Alibaba cluster traces

<i>SA-LSM</i>: Optimize Data Layout for LSM-tree Based Storage using Survival Analysis

Hedge Your Bets: Optimizing Long-term Cloud Costs by Mixing VM Purchasing Options

Online Algorithms for Uploading Deferrable Big Data to the Cloud

PackCache: An Online Cost-Driven Data Caching Algorithm in the Cloud

Cost-effective Data Analytics Across Multiple Cloud Regions

Cutting Your Cloud Computing Cost for Deadline-Constrained Batch Jobs

Dynamic Pricing and Profit Maximization for the Cloud with Geo-Distributed Data Centers.