Fault Detection and Diagnosis Software of LHAASO

Hangchang Zhang,Minhao Gu,Shaoshuai Fan
DOI: https://doi.org/10.1109/tns.2024.3454806
IF: 1.703
2024-01-01
IEEE Transactions on Nuclear Science
Abstract:The Large High Altitude Air Shower Observatory (LHAASO) is a mega-scale dual-task facility designed to study cosmic rays and γ-rays. Online computing system of LHAASO supports its online operation and computation. Physical phenomena such as cosmic rays occur unpredictably and therefore require the online computing system to run uninterruptedly. LHAASO is large and the environment is harsh, so the online computing system is subject to failure. Once a system fails, maintenance personnel are required to quickly analyze the cause of the failure and repair it. The Fault Detection and Diagnosis software(FADD) is designed to quickly detect and analyze system faults. The software implements comprehensive monitoring of each component of LHAASO’s online computing system (computing nodes, switches, and data flow software) and collects real-time status information. When a fault occurs, FADD can quickly analyze the cause of the fault and provide alarm information to the on-call staff as soon as possible. In addition, it can also analyze historical data within a specified period and generate data reports as needed. FADD is designed to fully consider the characteristics of large-scale high-energy physics experiments, and satisfy the requirements of high throughput and high efficiency by using a distributed architecture. The software consists of the following layers: information collection layer, data analysis layer, and result layer, and contains metrics detection software, fault monitoring module, fault diagnosis module, and other functional modules. FADD has been applied to LHAASO and can diagnose operational faults quickly and accurately, helping to reduce the burden on maintenance personnel.
What problem does this paper attempt to address?