Creat membership Creat membership
Sign in

Forgot password?

Confirm
  • Forgot password?
    Sign Up
  • Confirm
    Sign In
Creat membership Creat membership
Sign in

Forgot password?

Confirm
  • Forgot password?
    Sign Up
  • Confirm
    Sign In
Collection
For ¥0.57 per day, unlimited downloads CREATE MEMBERSHIP Download

toTop

If you have any feedback, Please follow the official account to submit feedback.

Turn on your phone and scan

home > search >

Online Diagnosis of Performance Variation in HPC Systems Using Machine Learning

Author:
Tuncer, Ozan  Ates, Emre  Zhang, Yijia  Turk, Ata  Brandt, Jim  Leung, Vitus J.  Egele, Manuel  Coskun, Ayse K.  


Journal:
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS


Issue Date:
2019


Abstract(summary):

As the size and complexity of high performance computing (HPC) systems grow in line with advancements in hardware and software technology, HPC systems increasingly suffer from performance variations due to shared resource contention as well as software-and hardware-related problems. Such performance variations can lead to failures and inefficiencies, which impact the cost and resilience of HPC systems. To minimize the impact of performance variations, one must quickly and accurately detect and diagnose the anomalies that cause the variations and take mitigating actions. However, it is difficult to identify anomalies based on the voluminous, high-dimensional, and noisy data collected by system monitoring infrastructures. This paper presents a novel machine learning based framework to automatically diagnose performance anomalies at runtime. Our framework leverages historical resource usage data to extract signatures of previously-observed anomalies. We first convert collected time series data into easy-to-compute statistical features. We then identify the features that are required to detect anomalies, and extract the signatures of these anomalies. At runtime, we use these signatures to diagnose anomalies with negligible overhead. We evaluate our framework using experiments on a real-world HPC supercomputer and demonstrate that our approach successfully identifies 98 percent of injected anomalies and consistently outperforms existing anomaly diagnosis techniques.


Page:
883---896


VIEW PDF

The preview is over

If you wish to continue, please create your membership or download this.

Create Membership

Similar Literature

Submit Feedback

This function is a member function, members do not limit the number of downloads