Loading...
Loading...

Go to the content (press return)

Programmer-directed partial redundancy for resilient HPC

Author
Subasi, O.; Arias, F.J.; Unsal, O.; Labarta, J.; Cristal, A.
Type of activity
Presentation of work at congresses
Name of edition
12th ACM International Conference on Computing Frontiers
Date of publication
2015
Presentation's date
2015-05
Book of congress proceedings
Proceedings of the 12th ACM International Conference on Computing Frontiers, CF 2015
First page
1
Last page
2
Publisher
Association for Computing Machinery (ACM)
DOI
https://doi.org/10.1145/2742854.2742903 Open in new window
Repository
http://hdl.handle.net/2117/91299 Open in new window
URL
http://dl.acm.org/citation.cfm?doid=2742854.2742903 Open in new window
Abstract
In this work we propose partial task replication and check-pointing for task-parallel HPC applications to mitigate silent data corruption (SDC) errors. As the complete replication of all application tasks can be prohibitive due to resource costs, we introduce programmer-directed selective replication mechanism to provide fault-tolerance while decreasing costs. Results show that our scheme detects and corrects around 65% of SDC errors with only 4% overhead on average.
Citation
Subasi, O., Arias, F.J., Unsal, O., Labarta, J., Cristal, A. Programmer-directed partial redundancy for resilient HPC. A: ACM International Conference on Computing Frontiers. "Proceedings of the 12th ACM International Conference on Computing Frontiers, CF 2015". Ischia: Association for Computing Machinery (ACM), 2015.
Keywords
Application tasks, Check pointing, Computer programming, Computer science, Resource costs, Selective replication, Silent data corruption (SDC), Task parallel, Task replications
Group of research
CAP - High Performace Computing Group

Participants