System support for large-scale internet services

Tao Yang,Jingyu Zhou
2006-01-01
Abstract:Large-scale Internet services are often hosted on giant clusters for high availability, scalability, and performance. Even with over-provisioning of system resources, Internet services can still be overloaded due to surprising events, abnormal client request patterns, or failures. We propose three techniques at the operating system and middleware levels to augment Internet applications to meet their QoS goals and to improve resource scheduling and accuracy of fault detection. First, we present a new size-adaptive scheduling algorithm called SRQ. The SRQ algorithm is a request-aware approach that manages resource usage at a request level instead of a thread or process level. Requests with smaller size are given higher priorities if they still meet soft deadlines, and this strategy can achieve better resource usage than the standard Linux. The second technique is a load shedding mechanism called selective early request termination, which monitors running time of requests, accounts for their resource usage, adaptively adjusts the selection threshold, and performs a safe termination for overdue long requests. Finally, we present a topology-adaptive hierarchical membership service that provides fast information propagation with low bandwidth usage. This service is critical for clustered nodes to make well-informed local decisions, such as load balancing, service discovery, and fault tolerance.
What problem does this paper attempt to address?