A New Approach for Evaluating the Performance of Distributed Latency-Sensitive Services

Theodoros Theodoropoulos,John Violos,Antonios Makris,Konstantinos Tserpes
2024-05-01
Abstract:Conventional latency metrics are formulated based on a broad definition of traditional monolithic services, and hence lack the capacity to address the complexities inherent in modern services and distributed computing paradigms. Consequently, their effectiveness in identifying areas for improvement is restricted, falling short of providing a comprehensive evaluation of service performance within the context of contemporary services and computing paradigms. More specifically, these metrics do not offer insights into two critical aspects of service performance: the frequency of latency surpassing specified Service Level Agreement (SLA) thresholds and the time required for latency to return to an acceptable level once the threshold is exceeded. This limitation is quite significant in the frame of contemporary latency-sensitive services, and especially immersive services that require deterministic low latency that behaves in a consistent manner. Towards addressing this limitation, the authors of this work propose 5 novel latency metrics that when leveraged alongside the conventional latency metrics manage to provide advanced insights that can be potentially used to improve service performance. The validity and usefulness of the proposed metrics in the frame of providing advanced insights into service performance is evaluated using a large-scale experiment.
Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
The paper attempts to address the inadequacies of existing latency metrics in evaluating the performance of modern distributed latency-sensitive services. These traditional metrics are based on a broad definition of conventional monolithic services and fail to adequately address the complexities in modern services and distributed computing paradigms. Specifically, they have limitations in two key aspects: 1. **Inability to reflect the frequency of latency exceeding specified Service Level Agreement (SLA) thresholds**: This refers to the frequency of occurrences where latency surpasses the predetermined standards in the system. 2. **Inability to reflect the time required to recover to an acceptable level after latency exceeds the threshold**: This refers to how long it takes for the system to return to normal after the latency exceeds the threshold. These two shortcomings are particularly important for modern latency-sensitive services, especially immersive services requiring deterministic low latency (such as Extended Reality (XR) and Massively Multiplayer Mobile Games (MMG)). Therefore, the paper proposes a new approach by introducing five new latency metrics based on fault tolerance to compensate for the deficiencies of existing metrics, thereby providing deeper insights into service performance and helping to optimize service performance. These new metrics can evaluate the stability and response time of services at a more granular level, especially in the face of high loads and sudden demands.