The Implementation and Evaluation of In-Band Network Management in Supercomputing System

Ji-Jun CAO,Li-Quan XIAO,Ke-Fei WANG,Zheng-Bin PANG,Lin CHEN
DOI: https://doi.org/10.11897/SP.J.1016.2016.01717
2016-01-01
Chinese Journal of Computers
Abstract:Interconnect network plays an important role in supercomputing system.And its manageability directly affects the RAS (Reliability,Availability and Serviceability)characteristics of the whole system.The Tianhe-2 supercomputing system uses proprietary high-speed interconnect network,which includes 5856 high-radix network router chips (NRC)and 18 304 network interface chips (NIC).For the very large-scale interconnect network,it is a great challenge to manage (such as configure and monitor)the numerous network chips and its ports in an efficient way.By implementing the in-band management scheme,we achieve a very efficient network management for the interconnect network in Tianhe-2 system.In this paper,we introduced the design of in-band network management in Tianhe-2 system,especially emphasizing on several key features, which include the basic functionalities and architecture of network management,format of management descriptors, dataflow and processing of management packets, basic frame of management software,etc.In this paper,we also evaluated the performance of in-band network management on the supercomputer that located in NSCC-GZ.The results demonstrate the efficiency of the in-band management for interconnect network.
What problem does this paper attempt to address?