HPC-SFI: System-Level Fault Injection for High Performance Computing Systems. Wang, Y., Zhang, Q., Liu, Y., & Qian, D. In IFIP International Conference on Network and Parallel Computing, pages 103–113, 2018. Springer.
bibtex   
@InProceedings{wang18hpc,
  author       = {Wang, Yanqi and Zhang, Qi and Liu, Yi and Qian, Depei},
  title        = {HPC-SFI: System-Level Fault Injection for High Performance Computing Systems},
  booktitle    = {IFIP International Conference on Network and Parallel Computing},
  year         = {2018},
  pages        = {103--113},
  organization = {Springer},
  comment      = {#sfi \#hpc

* motivation: fault tolerance is key challenge for large-scale parallel
  systems

  * statistics show: MTBFS of current supercomputers is several hours

* fault injection to evaluate *effectiveness* of fault-tolerance mechanisms

  * i.e., recovery after pre-defined time period

* inject in system-level failures (see Table 1)

  * in-node faults
  * faults in interconnection network
  * faults in storage / parallel file system
  * pseudo-randomly, according to pre-defined parameters

* they have a "model generation module" which apparently computes
  the fault injection experiment (which nodes to target, etc.)

  * i.e., contains system model
  * "model" apparently not in the sense of fault or dependability model

* measures

  * activation probability
  * detection probability
  * fault injection latency},
  file         = {:wang18hpc - HPC-SFI_ System-Level Fault Injection for High Performance Computing Systems.pdf:PDF},
  groups       = {fault injection},
  timestamp    = {2019-03-01},
}

Downloads: 0