Equipment management and system maintenance on HPC platform

Lin Jiao,Xu Weiping,Zhang Wusheng,Yang Guangwen
DOI: https://doi.org/10.3969/j.issn.1002-4956.2013.05.025
2013-01-01
Abstract:The HPC platform of Tsinghua National Laboratory for Information Science and Technology is a common service platform in Tsinghua University.Equipment management and system maintenance are one of the basic tasks for HPC platform.In equipment management,an auto-artificial method to manage and monitor the HPC equipment is adopted.In system maintenance,an auto monitoring and reconstruction system is developed to preserve cluster system.The system for equipment management and system maintenance has been applied on HPC platform and provides stable computing service for users.
What problem does this paper attempt to address?