HPC-SFI: System-Level Fault Injection for High Performance Computing Systems

HPC-SFI: System-Level Fault Injection for High Performance Computing Systems. Wang, Y., Zhang, Q., Liu, Y., & Qian, D. In IFIP International Conference on Network and Parallel Computing, pages 103–113, 2018. Springer.
bibtex

@InProceedings{wang18hpc,
  author       = {Wang, Yanqi and Zhang, Qi and Liu, Yi and Qian, Depei},
  title        = {HPC-SFI: System-Level Fault Injection for High Performance Computing Systems},
  booktitle    = {IFIP International Conference on Network and Parallel Computing},
  year         = {2018},
  pages        = {103--113},
  organization = {Springer},
  comment      = {#sfi \#hpc

* motivation: fault tolerance is key challenge for large-scale parallel
  systems

  * statistics show: MTBFS of current supercomputers is several hours

* fault injection to evaluate *effectiveness* of fault-tolerance mechanisms

  * i.e., recovery after pre-defined time period

* inject in system-level failures (see Table 1)

  * in-node faults
  * faults in interconnection network
  * faults in storage / parallel file system
  * pseudo-randomly, according to pre-defined parameters

* they have a "model generation module" which apparently computes
  the fault injection experiment (which nodes to target, etc.)

  * i.e., contains system model
  * "model" apparently not in the sense of fault or dependability model

* measures

  * activation probability
  * detection probability
  * fault injection latency},
  file         = {:wang18hpc - HPC-SFI_ System-Level Fault Injection for High Performance Computing Systems.pdf:PDF},
  groups       = {fault injection},
  timestamp    = {2019-03-01},
}

Downloads: 0

{"_id":"ii63SW4FKH83PTj9d","bibbaseid":"wang-zhang-liu-qian-hpcsfisystemlevelfaultinjectionforhighperformancecomputingsystems-2018","author_short":["Wang, Y.","Zhang, Q.","Liu, Y.","Qian, D."],"bibdata":{"bibtype":"inproceedings","type":"inproceedings","author":[{"propositions":[],"lastnames":["Wang"],"firstnames":["Yanqi"],"suffixes":[]},{"propositions":[],"lastnames":["Zhang"],"firstnames":["Qi"],"suffixes":[]},{"propositions":[],"lastnames":["Liu"],"firstnames":["Yi"],"suffixes":[]},{"propositions":[],"lastnames":["Qian"],"firstnames":["Depei"],"suffixes":[]}],"title":"HPC-SFI: System-Level Fault Injection for High Performance Computing Systems","booktitle":"IFIP International Conference on Network and Parallel Computing","year":"2018","pages":"103–113","organization":"Springer","comment":"#sfi #hpc * motivation: fault tolerance is key challenge for large-scale parallel systems * statistics show: MTBFS of current supercomputers is several hours * fault injection to evaluate *effectiveness* of fault-tolerance mechanisms * i.e., recovery after pre-defined time period * inject in system-level failures (see Table 1) * in-node faults * faults in interconnection network * faults in storage / parallel file system * pseudo-randomly, according to pre-defined parameters * they have a \"model generation module\" which apparently computes the fault injection experiment (which nodes to target, etc.) * i.e., contains system model * \"model\" apparently not in the sense of fault or dependability model * measures * activation probability * detection probability * fault injection latency","file":":wang18hpc - HPC-SFI_ System-Level Fault Injection for High Performance Computing Systems.pdf:PDF","groups":"fault injection","timestamp":"2019-03-01","bibtex":"@InProceedings{wang18hpc,\n author = {Wang, Yanqi and Zhang, Qi and Liu, Yi and Qian, Depei},\n title = {HPC-SFI: System-Level Fault Injection for High Performance Computing Systems},\n booktitle = {IFIP International Conference on Network and Parallel Computing},\n year = {2018},\n pages = {103--113},\n organization = {Springer},\n comment = {#sfi \\#hpc\n\n* motivation: fault tolerance is key challenge for large-scale parallel\n systems\n\n * statistics show: MTBFS of current supercomputers is several hours\n\n* fault injection to evaluate *effectiveness* of fault-tolerance mechanisms\n\n * i.e., recovery after pre-defined time period\n\n* inject in system-level failures (see Table 1)\n\n * in-node faults\n * faults in interconnection network\n * faults in storage / parallel file system\n * pseudo-randomly, according to pre-defined parameters\n\n* they have a \"model generation module\" which apparently computes\n the fault injection experiment (which nodes to target, etc.)\n\n * i.e., contains system model\n * \"model\" apparently not in the sense of fault or dependability model\n\n* measures\n\n * activation probability\n * detection probability\n * fault injection latency},\n file = {:wang18hpc - HPC-SFI_ System-Level Fault Injection for High Performance Computing Systems.pdf:PDF},\n groups = {fault injection},\n timestamp = {2019-03-01},\n}\n\n","author_short":["Wang, Y.","Zhang, Q.","Liu, Y.","Qian, D."],"key":"wang18hpc","id":"wang18hpc","bibbaseid":"wang-zhang-liu-qian-hpcsfisystemlevelfaultinjectionforhighperformancecomputingsystems-2018","role":"author","urls":{},"metadata":{"authorlinks":{}},"downloads":0,"html":""},"bibtype":"inproceedings","biburl":"https://bibbase.org/network/files/AsPiHTmHHGjgy6xSQ","dataSources":["wjZw5s4JL49uLwn3p"],"keywords":[],"search_terms":["hpc","sfi","system","level","fault","injection","high","performance","computing","systems","wang","zhang","liu","qian"],"title":"HPC-SFI: System-Level Fault Injection for High Performance Computing Systems","year":2018}