Loading...
Loading...

Go to the content (press return)

Asynchronous and exact forward recovery for detected errors in iterative solvers

Author
Jaulmes, L.; Casas, M.; Moreto, M.; Ayguade, E.; Labarta, J.; Valero, M.
Type of activity
Journal article
Journal
IEEE transactions on parallel and distributed systems
Date of publication
2018-03-19
Volume
29
Number
9
First page
1961
Last page
1974
DOI
https://doi.org/10.1109/TPDS.2018.2817524 Open in new window
Project funding
Computación de Altas Prestaciones VII
Repository
http://hdl.handle.net/2117/118042 Open in new window
URL
https://ieeexplore.ieee.org/document/8320336/ Open in new window
Abstract
Current trends and projections show that faults in computer systems become increasingly common. Such errors may be detected, and possibly corrected transparently, e.g. by Error Correcting Codes (ECC). For a program to be fault-tolerant, it needs to also handle the Errors that are Detected and Uncorrected (DUE), such as an ECC encountering too many bit flips in a codeword. While correcting an error has an overhead in itself, it can also affect the progress of a program. The most generic technique...
Citation
Jaulmes, L., Casas, M., Moreto, M., Ayguade, E., Labarta, J., Valero, M. Asynchronous and exact forward recovery for detected errors in iterative solvers. "IEEE transactions on parallel and distributed systems", 19 Març 2018, vol. 29, núm. 9, p. 1961-1974
Keywords
Error correction codes, Hardware, Redundancy, Program processors, Programming, Registers
Group of research
CAP - High Performace Computing Group

Participants

Attachments