Programming Support and Adaptive Checkpointing for High-Throughput Data Services with Log-Based Recovery

Jingyu Zhou,Caijie Zhang,Hong Tang,Jiesheng Wu,Tao Yang
DOI: https://doi.org/10.1109/dsn.2010.5545015
2010-01-01
Abstract:Many applications in large-scale data mining and offline processing are organized as network services, running continuously or for a long period of time. To sustain high-throughput, these services often keep their data in memory, thus susceptible to failures. On the other hand, the availability requirement for these services is not as stringent as online services exposed to millions of users. But those data-intensive offline or mining applications do require data persistence to survive failures. This paper presents programming and runtime support called SLACH for building multi-threaded high-throughput persistent services. To keep in-memory objects persistent, SLACH employs application-assisted logging and checkpointing for log-based recovery while maximizing throughput and concurrency. SLACH adaptively adjusts checkpointing frequency based on log growth and throughput demand to balance between runtime overhead and recovery speed. This paper describes the design and API of SLACH, adaptive checkpoint control, and our experiences and experiments in using SLACH at Ask.com.
What problem does this paper attempt to address?