Abstract:Large-scale data-intensive applications provide services to users by routing service requests to geographically distributed data centers interconnected by Internet links. In order to achieve good reliability and data access latency performance, cloud service providers often simultaneously place multiple copies of the data in different data centers. The network communication required for updating the multiple data copies incurs an operational cost. At the same time, the penalty incurred by the Service Level Agreement (SLA) violation for data access from the data centers also imposes an operational cost on the service providers. In this paper, we tackle the problem of data placement in distributed data centers with the aim to minimize the operational cost incurred by delay SLA violation penalty and inter-data center network communication, assuming each data has <math>K</math> data replicas. We propose a K-level Cluster-based Data Placement algorithm (K-CDP) for the problem. The algorithm solves the linear programming relaxation and dual programming problems corresponding to the problem of minimizing SLA violation penalty cost caused by placing a replica of each data in a data center. Based on the obtained solutions, the algorithm clusters the data so that the data with similar placeable data centers form a data cluster. For the data in each cluster, the algorithm selects <math>K</math> data centers to minimize the operational cost. We prove that algorithm K-CDP is 2-approximation to the data placement problem. Our simulation results demonstrate that the proposed algorithm can effectively reduce the penalty cost incurred by delay SLA violation, the network communication cost, and the operational cost of data centers.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve the data placement problem in distributed data centers, with the goal of minimizing the operating costs arising from latency service - level agreement (SLA) violations and data synchronization network communications. Specifically, the paper focuses on the following two main problems: 1. **Cost of latency SLA violations**: - When users access data from a data center, if the response latency exceeds the maximum tolerable latency requirement specified in the SLA, the cloud service provider will face an SLA violation penalty. - In order to minimize the cost of latency SLA violations, cloud service providers usually place data in multiple data centers close to users. 2. **Network communication costs**: - When data is synchronized between multiple data centers, network communication costs are incurred. - In order to reduce network communication costs, cloud service providers tend to place data in geographically close data centers. However, there is a trade - off between these two goals. Placing data in data centers close to users can reduce the cost of latency SLA violations, but it will increase network communication costs; and vice versa. Therefore, the paper proposes a K - level clustering - based data placement algorithm (K - CDP) to comprehensively consider these two goals and minimize the total operating costs. ### Overview of the solution The paper proposes the K - CDP algorithm, which is divided into three stages: 1. **Single - copy data placement**: - Assuming that each piece of data has only one copy, the cost of latency SLA violations is minimized through linear programming relaxation and dual programming problems. - Solve the linear programming relaxation problem and the dual programming problem to obtain the optimal data placement variable \(x^*_{m,j}\) and the dual variable \(\alpha^*_m\). 2. **Data clustering**: - According to the optimal solutions \(x^*_{m,j}\) and \(\alpha^*_m\), cluster data with similar placeable data centers. - Implement data clustering through Algorithm 1, select the central data of each cluster, and update the data center sets of other data. 3. **Clustering - based data placement**: - For each data cluster, select K data centers to place K copies of the data in order to minimize the operating costs. - This is implemented through Algorithm 2. First, select a data center to place the first copy, and then iteratively select the remaining K - 1 data centers, each time trying to minimize the increase in operating costs. ### Algorithm performance analysis The paper proves that the K - CDP algorithm is a 2 - approximate algorithm for the data placement problem. Specifically, by solving the linear programming relaxation problem and the dual programming problem, the K - CDP algorithm can effectively reduce the cost of latency SLA violations and network communication costs, thereby minimizing the total operating costs. ### Experimental results The paper verifies the effectiveness of the K - CDP algorithm through simulation experiments. The experimental results show that, compared with the random algorithm (Random) and the existing algorithm (TOPR), the K - CDP algorithm performs better in reducing the cost of latency SLA violations and network communication costs. As the amount of data increases, the advantages of the K - CDP algorithm become more obvious, and it can significantly reduce the total operating costs under different amounts of data. ### Conclusion The paper successfully solves the data placement problem in distributed data centers. Through the K - CDP algorithm, it achieves a balance between minimizing the cost of latency SLA violations and network communication costs, and effectively reduces the total operating costs. This research result has important reference value for cloud service providers to optimize data placement strategies.

Data placement in distributed data centers for improved SLA and network cost

GCplace: geo-cloud based correlation aware data replica placement.

Performance-Driven Task and Data Co-scheduling Algorithms for Data-Intensive Applications in Grid Computing

Cooperative Data Caching for Cloud Data Servers.

Clustered K-Center: Effective Replica Placement in Peer-to-Peer Systems

QoS-Aware Data Placement for MapReduce Applications in Geo-Distributed Data Centers

Data Center Network Design for Internet-Related Services and Cloud Computing

Data Center Network Placement and Service Protection in All-Optical Mesh Networks

An Algorithm for Network and Data-aware Placement of Multi-Tier Applications in Cloud Data Centers

Scalable Data Center Network with distributed placement of component sets in optical networks

Optimal Data Placement for Data-Sharing Scientific Workflows in Heterogeneous Edge-Cloud Computing Environments

Probabilistic Region Failure-Aware Data Center Network and Content Placement.

A Novel Data Placement Strategy for Data-Sharing Scientific Workflows in Heterogeneous Edge-Cloud Computing Environments

Efficient Data Replica Placement for Sensor Clouds.

Data Placement for Multi-Tenant Data Federation on the Cloud

A Global Cost-Aware Container Scheduling Strategy in Cloud Data Centers

Let's stay together: Towards traffic aware virtual machine placement in data centers

Online Cost Minimization for Operating Geo-Distributed Cloud CDNs

A Data Placement Strategy for Scientific Workflow in Hybrid Cloud

Cost-minimizing Dynamic Migration of Content Distribution Services into Hybrid Clouds

Title Cost-minimizing dynamic migration of content distributionservices into hybrid clouds