Affordable Access

Pin-pointing Node Failures in HPC Systems

Authors
  • Roman, E
  • Das, A
  • Mueller, F
  • Hargrove, PH
Publication Date
May 15, 2021
Source
eScholarship - University of California
License
Unknown
External links

Abstract

Automated fault prediction and diagnosis in HPC systems needs to be efficient for better system resilience. With increasing scalability required for exascale, accurate fault prediction aiding in quick remedy is hard. With changing supercomputer architectures, distilling fault data from the noisy raw logs requires substantial efforts. Predicting node failures in such voluminous system logs is challenging. To this end, we investigate an interesting way to pin-point node failures in such supercomputing systems. Our study on Cray system data with automated machine learning tools suggests that specific patterns of event messages on node unavailability can be indicator to node failures. This data extraction coupled with system and job data correlation helps in devising a methodology to predict node failures and their location over a specific time frame. This work aims to enable broader applicability for a generic fault prediction framework.

Report this publication

Statistics

Seen <100 times