A Scalable Fault Management Architecture for Ccnuma Server

Yan Yang,Xingjun Zhang,Endong Wang,Nan Wu,Xiaoshe Dong
DOI: https://doi.org/10.1109/incos.2011.35
2011-01-01
Abstract:Linux servers with heterogeneous architectures present a new challenge for fault management. With the significant increase in the numbers and types of hardware components, separate fault management becomes more complex and inefficient. It is clear that centralized management, automatic recovering and scalable design must be incorporated in the modern fault management system. Based on the ccNUMA architecture, the paper proposes a scalable fault management architecture, and studies the implementation technologies. It aims to enable computers to automatically detect error, diagnose error and handle fault. The architecture uses modular design and supports distributed environment with good extensibility and scalability. In practice, the architecture is effective and can raise the reliability of servers.
What problem does this paper attempt to address?