Data collection for failure prediction toward exascale supercomputers

Wei HU,Yanhuang JIANG,Guangming LIU,Wenrui DONG,Xinwu CUI
DOI: https://doi.org/10.11887/j.cn.201601016
2016-01-01
Abstract:Aimed at an exascale supercomputer,an FPDC (failure prediction data collection framework)was introduced to fully collect the data related to the state of compute nodes’health.An adaptive multi-layer data aggregation method was presented for data aggregation with less overhead. Extensive experiments,by implementing FPDC on TH -1A,indicate that the FPDC has the advantage of high efficiency and good scalability.
What problem does this paper attempt to address?