Characterizing a Detection Strategy for Transient Faults in HPC

Handling faults is a growing concern in HPC; greater varieties, higher error rates, larger detection intervals and silent faults are expected in the future. It is projected that, in exascale systems, errors will occur several times a day, and that they will propagate to generate errors that will ran...

Descripción completa

Guardado en:
Detalles Bibliográficos
Autores principales: Montezanti, Diego Miguel, Rexachs del Rosario, Dolores, Rucci, Enzo, Luque Fadón, Emilio, Naiouf, Marcelo, De Giusti, Armando Eduardo, Feierherd, Guillermo Eugenio, Pesado, Patricia Mabel, Russo, Claudia Cecilia
Formato: Libro Capitulo de libro
Lenguaje:Inglés
Publicado: Editorial de la Universidad Nacional de La Plata (EDULP) 2016
Materias:
HPC
Acceso en línea:http://sedici.unlp.edu.ar/handle/10915/81217
Aporte de:
id I19-R120-10915-81217
record_format dspace
institution Universidad Nacional de La Plata
institution_str I-19
repository_str R-120
collection SEDICI (UNLP)
language Inglés
topic Ciencias Informáticas
transient faults
detection
scientific parallel applications
silent data corruption
HPC
fault injection
spellingShingle Ciencias Informáticas
transient faults
detection
scientific parallel applications
silent data corruption
HPC
fault injection
Montezanti, Diego Miguel
Rexachs del Rosario, Dolores
Rucci, Enzo
Luque Fadón, Emilio
Naiouf, Marcelo
De Giusti, Armando Eduardo
Feierherd, Guillermo Eugenio
Pesado, Patricia Mabel
Russo, Claudia Cecilia
Characterizing a Detection Strategy for Transient Faults in HPC
topic_facet Ciencias Informáticas
transient faults
detection
scientific parallel applications
silent data corruption
HPC
fault injection
description Handling faults is a growing concern in HPC; greater varieties, higher error rates, larger detection intervals and silent faults are expected in the future. It is projected that, in exascale systems, errors will occur several times a day, and that they will propagate to generate errors that will range from process crashes to corrupted results, with undetected errors in applications that are still running. In this article, we analyze a methodology for transient fault detection (called SMCV) for MPI applications. The methodology is based on software replication, and it assumes that data corruption is made apparent producing different messages between replicas. SMCV allows obtaining reliable executions with correct results, or, at least, leading the system to a safe stop. This work presents a complete characterization, formally defining the behavior in the presence of faults and experimentally validating it in order to show its efficacy and viability to detect transient faults in HPC systems.
format Libro
Capitulo de libro
author Montezanti, Diego Miguel
Rexachs del Rosario, Dolores
Rucci, Enzo
Luque Fadón, Emilio
Naiouf, Marcelo
De Giusti, Armando Eduardo
Feierherd, Guillermo Eugenio
Pesado, Patricia Mabel
Russo, Claudia Cecilia
author_facet Montezanti, Diego Miguel
Rexachs del Rosario, Dolores
Rucci, Enzo
Luque Fadón, Emilio
Naiouf, Marcelo
De Giusti, Armando Eduardo
Feierherd, Guillermo Eugenio
Pesado, Patricia Mabel
Russo, Claudia Cecilia
author_sort Montezanti, Diego Miguel
title Characterizing a Detection Strategy for Transient Faults in HPC
title_short Characterizing a Detection Strategy for Transient Faults in HPC
title_full Characterizing a Detection Strategy for Transient Faults in HPC
title_fullStr Characterizing a Detection Strategy for Transient Faults in HPC
title_full_unstemmed Characterizing a Detection Strategy for Transient Faults in HPC
title_sort characterizing a detection strategy for transient faults in hpc
publisher Editorial de la Universidad Nacional de La Plata (EDULP)
publishDate 2016
url http://sedici.unlp.edu.ar/handle/10915/81217
work_keys_str_mv AT montezantidiegomiguel characterizingadetectionstrategyfortransientfaultsinhpc
AT rexachsdelrosariodolores characterizingadetectionstrategyfortransientfaultsinhpc
AT ruccienzo characterizingadetectionstrategyfortransientfaultsinhpc
AT luquefadonemilio characterizingadetectionstrategyfortransientfaultsinhpc
AT naioufmarcelo characterizingadetectionstrategyfortransientfaultsinhpc
AT degiustiarmandoeduardo characterizingadetectionstrategyfortransientfaultsinhpc
AT feierherdguillermoeugenio characterizingadetectionstrategyfortransientfaultsinhpc
AT pesadopatriciamabel characterizingadetectionstrategyfortransientfaultsinhpc
AT russoclaudiacecilia characterizingadetectionstrategyfortransientfaultsinhpc
bdutipo_str Repositorios
_version_ 1764820487855144960