Characterizing a Detection Strategy for Transient Faults in HPC
Handling faults is a growing concern in HPC; greater varieties, higher error rates, larger detection intervals and silent faults are expected in the future. It is projected that, in exascale systems, errors will occur several times a day, and that they will propagate to generate errors that will ran...
Guardado en:
| Autores principales: | , , , , , , , , |
|---|---|
| Formato: | Libro Capitulo de libro |
| Lenguaje: | Inglés |
| Publicado: |
Editorial de la Universidad Nacional de La Plata (EDULP)
2016
|
| Materias: | |
| Acceso en línea: | http://sedici.unlp.edu.ar/handle/10915/81217 |
| Aporte de: |
| id |
I19-R120-10915-81217 |
|---|---|
| record_format |
dspace |
| institution |
Universidad Nacional de La Plata |
| institution_str |
I-19 |
| repository_str |
R-120 |
| collection |
SEDICI (UNLP) |
| language |
Inglés |
| topic |
Ciencias Informáticas transient faults detection scientific parallel applications silent data corruption HPC fault injection |
| spellingShingle |
Ciencias Informáticas transient faults detection scientific parallel applications silent data corruption HPC fault injection Montezanti, Diego Miguel Rexachs del Rosario, Dolores Rucci, Enzo Luque Fadón, Emilio Naiouf, Marcelo De Giusti, Armando Eduardo Feierherd, Guillermo Eugenio Pesado, Patricia Mabel Russo, Claudia Cecilia Characterizing a Detection Strategy for Transient Faults in HPC |
| topic_facet |
Ciencias Informáticas transient faults detection scientific parallel applications silent data corruption HPC fault injection |
| description |
Handling faults is a growing concern in HPC; greater varieties, higher error rates, larger detection intervals and silent faults are expected in the future. It is projected that, in exascale systems, errors will occur several times a day, and that they will propagate to generate errors that will range from process crashes to corrupted results, with undetected errors in applications that are still running. In this article, we analyze a methodology for transient fault detection (called SMCV) for MPI applications. The methodology is based on software replication, and it assumes that data corruption is made apparent producing different messages between replicas. SMCV allows obtaining reliable executions with correct results, or, at least, leading the system to a safe stop. This work presents a complete characterization, formally defining the behavior in the presence of faults and experimentally validating it in order to show its efficacy and viability to detect transient faults in HPC systems. |
| format |
Libro Capitulo de libro |
| author |
Montezanti, Diego Miguel Rexachs del Rosario, Dolores Rucci, Enzo Luque Fadón, Emilio Naiouf, Marcelo De Giusti, Armando Eduardo Feierherd, Guillermo Eugenio Pesado, Patricia Mabel Russo, Claudia Cecilia |
| author_facet |
Montezanti, Diego Miguel Rexachs del Rosario, Dolores Rucci, Enzo Luque Fadón, Emilio Naiouf, Marcelo De Giusti, Armando Eduardo Feierherd, Guillermo Eugenio Pesado, Patricia Mabel Russo, Claudia Cecilia |
| author_sort |
Montezanti, Diego Miguel |
| title |
Characterizing a Detection Strategy for Transient Faults in HPC |
| title_short |
Characterizing a Detection Strategy for Transient Faults in HPC |
| title_full |
Characterizing a Detection Strategy for Transient Faults in HPC |
| title_fullStr |
Characterizing a Detection Strategy for Transient Faults in HPC |
| title_full_unstemmed |
Characterizing a Detection Strategy for Transient Faults in HPC |
| title_sort |
characterizing a detection strategy for transient faults in hpc |
| publisher |
Editorial de la Universidad Nacional de La Plata (EDULP) |
| publishDate |
2016 |
| url |
http://sedici.unlp.edu.ar/handle/10915/81217 |
| work_keys_str_mv |
AT montezantidiegomiguel characterizingadetectionstrategyfortransientfaultsinhpc AT rexachsdelrosariodolores characterizingadetectionstrategyfortransientfaultsinhpc AT ruccienzo characterizingadetectionstrategyfortransientfaultsinhpc AT luquefadonemilio characterizingadetectionstrategyfortransientfaultsinhpc AT naioufmarcelo characterizingadetectionstrategyfortransientfaultsinhpc AT degiustiarmandoeduardo characterizingadetectionstrategyfortransientfaultsinhpc AT feierherdguillermoeugenio characterizingadetectionstrategyfortransientfaultsinhpc AT pesadopatriciamabel characterizingadetectionstrategyfortransientfaultsinhpc AT russoclaudiacecilia characterizingadetectionstrategyfortransientfaultsinhpc |
| bdutipo_str |
Repositorios |
| _version_ |
1764820487855144960 |