In this work we propose partial task replication and check-pointing for task-parallel HPC applications to mitigate silent data corruption (SDC) errors. As the complete replication of all application tasks can be prohibitive due to resource costs, we introduce programmer-directed selective replication mechanism to provide fault-tolerance while decreasing costs. Results show that our scheme detects and corrects around 65% of SDC errors with only 4% overhead on average.
Subasi, O., Arias, F.J., Unsal, O., Labarta, J., Cristal, A. Programmer-directed partial redundancy for resilient HPC. A: ACM International Conference on Computing Frontiers. "Proceedings of the 12th ACM International Conference on Computing Frontiers, CF 2015". Ischia: Association for Computing Machinery (ACM), 2015.