Loading...
Loading...

Go to the content (press return)

Spatial support vector regression to detect silent errors in the exascale era

Author
Subasi, O.; Di, S.; Bautista Gomez, Leonardo Arturo; Balaprakash, P.; Unsal, O.; Labarta, J.; Cristal, A.; Cappello, F.
Type of activity
Presentation of work at congresses
Name of edition
16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing
Date of publication
2016
Presentation's date
2016-05
Book of congress proceedings
2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGrid 2016: 16-19 May 2016, Cartagena, Colombia: proceedings
First page
413
Last page
424
Publisher
Institute of Electrical and Electronics Engineers (IEEE)
DOI
https://doi.org/10.1109/CCGrid.2016.33 Open in new window
Repository
http://hdl.handle.net/2117/97167 Open in new window
URL
http://ieeexplore.ieee.org/document/7515717/ Open in new window
Abstract
As the exascale era approaches, the increasing capacity of high-performance computing (HPC) systems with targeted power and energy budget goals introduces significant challenges in reliability. Silent data corruptions (SDCs) or silent errors are one of the major sources that corrupt the executionresults of HPC applications without being detected. In this work, we explore a low-memory-overhead SDC detector, by leveraging epsilon-insensitive support vector machine regression, to detect SDCs that o...
Citation
Subasi, O., Di, S., Bautista, L., Balaprakash, P., Unsal, O., Labarta, J., Cristal, A., Cappello, F. Spatial support vector regression to detect silent errors in the exascale era. A: IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. "2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGrid 2016: 16-19 May 2016, Cartagena, Colombia: proceedings". Cartagena: Institute of Electrical and Electronics Engineers (IEEE), 2016, p. 413-424.
Keywords
Benchmarking, Budget control, Cluster computing, Detection sensitivity, Distributed computer systems, Errors, Exascale, Fault tolerance, High performance computing systems, Increasing capacities, Silent data corruptions, State-of-the-art techniques, Support vector machine regressions, Support vector machines, Support vector regression (SVR)
Group of research
CAP - High Performace Computing Group

Participants

  • Subasi, Omer  (author and speaker )
  • Di, Sheng  (author and speaker )
  • Bautista Gomez, Leonardo Arturo  (author and speaker )
  • Balaprakash, Prasanna  (author and speaker )
  • Unsal, Osman Sabri  (author and speaker )
  • Labarta Mancho, Jesus Jose  (author and speaker )
  • Cristal Kestelman, Adrian  (author and speaker )
  • Cappello, Franck  (author and speaker )