Loading...
Loading...

Go to the content (press return)

A directive-based approach to perform persistent checkpoint/restart

Author
Maroñas, M.; Mateo, S.; Beltran, V.; Ayguade, E.
Type of activity
Presentation of work at congresses
Name of edition
2017 International Conference on High Performance Computing & Simulation
Date of publication
2017
Presentation's date
2017-07-17
Book of congress proceedings
HPCS 2017: 2017 International Conference on High Performance Computing & Simulation: proceedings: 17-21 July 2017: Genoa, Italy
First page
442
Last page
451
Publisher
Institute of Electrical and Electronics Engineers (IEEE)
DOI
https://doi.org/10.1109/HPCS.2017.72 Open in new window
Repository
http://hdl.handle.net/2117/107925 Open in new window
URL
http://ieeexplore.ieee.org/abstract/document/8035111/ Open in new window
Abstract
Exascale platforms require support for resilience capabilities due to increasing numbers of components and associated error rates. In this paper, we present a new directive-based approach to perform application-level checkpoint/restart in a simplified and portable way. We propose a solution based on compiler directives, similar to OpenMP, that allows users to easily specify the state of the application that has to be saved and restored. This leaves the tedious and error-prone serialization and d...
Citation
Maroñas, M., Mateo, S., Beltran, V., Ayguade, E. A directive-based approach to perform persistent checkpoint/restart. A: International Conference on High Performance Computing and Simulation. "HPCS 2017: 2017 International Conference on High Performance Computing & Simulation: proceedings: 17-21 July 2017: Genoa, Italy". Genoa: Institute of Electrical and Electronics Engineers (IEEE), 2017, p. 442-451.
Keywords
Checkpoint/restart, Checkpointing, Ex-ascale, Fault tolerance, Fault tolerant systems, Libraries, Programmability, Programming models, Redundancy, Resilience, Resiliency, Tools
Group of research
CAP - High Performace Computing Group

Participants