Tools and Techniques for Managing Clusters for SciDAC Lattice QCD at Fermilab

A. Singh,D. Holmgren,R. Rechenmacher,S. Epsteyn
DOI: https://doi.org/10.48550/arXiv.cs/0307021
2003-07-09
Abstract:Fermilab operates several clusters for lattice gauge computing. Minimal manpower is available to manage these clusters. We have written a number of tools and developed techniques to cope with this task. We describe our tools which use the IPMI facilities of our systems for hardware management tasks such as remote power control, remote system resets, and health monitoring. We discuss our techniques involving network booting for installation and upgrades of the operating system on these computers, and for reloading BIOS and other firmware. Finally, we discuss our tools for parallel command processing and their use in monitoring and administrating the PBS batch queue system used on our clusters.
Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?