{"_id":"ii63SW4FKH83PTj9d","bibbaseid":"wang-zhang-liu-qian-hpcsfisystemlevelfaultinjectionforhighperformancecomputingsystems-2018","author_short":["Wang, Y.","Zhang, Q.","Liu, Y.","Qian, D."],"bibdata":{"bibtype":"inproceedings","type":"inproceedings","author":[{"propositions":[],"lastnames":["Wang"],"firstnames":["Yanqi"],"suffixes":[]},{"propositions":[],"lastnames":["Zhang"],"firstnames":["Qi"],"suffixes":[]},{"propositions":[],"lastnames":["Liu"],"firstnames":["Yi"],"suffixes":[]},{"propositions":[],"lastnames":["Qian"],"firstnames":["Depei"],"suffixes":[]}],"title":"HPC-SFI: System-Level Fault Injection for High Performance Computing Systems","booktitle":"IFIP International Conference on Network and Parallel Computing","year":"2018","pages":"103–113","organization":"Springer","comment":"#sfi #hpc * motivation: fault tolerance is key challenge for large-scale parallel systems * statistics show: MTBFS of current supercomputers is several hours * fault injection to evaluate *effectiveness* of fault-tolerance mechanisms * i.e., recovery after pre-defined time period * inject in system-level failures (see Table 1) * in-node faults * faults in interconnection network * faults in storage / parallel file system * pseudo-randomly, according to pre-defined parameters * they have a \"model generation module\" which apparently computes the fault injection experiment (which nodes to target, etc.) * i.e., contains system model * \"model\" apparently not in the sense of fault or dependability model * measures * activation probability * detection probability * fault injection latency","file":":wang18hpc - HPC-SFI_ System-Level Fault Injection for High Performance Computing Systems.pdf:PDF","groups":"fault injection","timestamp":"2019-03-01","bibtex":"@InProceedings{wang18hpc,\n author = {Wang, Yanqi and Zhang, Qi and Liu, Yi and Qian, Depei},\n title = {HPC-SFI: System-Level Fault Injection for High Performance Computing Systems},\n booktitle = {IFIP International Conference on Network and Parallel Computing},\n year = {2018},\n pages = {103--113},\n organization = {Springer},\n comment = {#sfi \\#hpc\n\n* motivation: fault tolerance is key challenge for large-scale parallel\n systems\n\n * statistics show: MTBFS of current supercomputers is several hours\n\n* fault injection to evaluate *effectiveness* of fault-tolerance mechanisms\n\n * i.e., recovery after pre-defined time period\n\n* inject in system-level failures (see Table 1)\n\n * in-node faults\n * faults in interconnection network\n * faults in storage / parallel file system\n * pseudo-randomly, according to pre-defined parameters\n\n* they have a \"model generation module\" which apparently computes\n the fault injection experiment (which nodes to target, etc.)\n\n * i.e., contains system model\n * \"model\" apparently not in the sense of fault or dependability model\n\n* measures\n\n * activation probability\n * detection probability\n * fault injection latency},\n file = {:wang18hpc - HPC-SFI_ System-Level Fault Injection for High Performance Computing Systems.pdf:PDF},\n groups = {fault injection},\n timestamp = {2019-03-01},\n}\n\n","author_short":["Wang, Y.","Zhang, Q.","Liu, Y.","Qian, D."],"key":"wang18hpc","id":"wang18hpc","bibbaseid":"wang-zhang-liu-qian-hpcsfisystemlevelfaultinjectionforhighperformancecomputingsystems-2018","role":"author","urls":{},"metadata":{"authorlinks":{}},"downloads":0,"html":""},"bibtype":"inproceedings","biburl":"https://bibbase.org/network/files/AsPiHTmHHGjgy6xSQ","dataSources":["wjZw5s4JL49uLwn3p"],"keywords":[],"search_terms":["hpc","sfi","system","level","fault","injection","high","performance","computing","systems","wang","zhang","liu","qian"],"title":"HPC-SFI: System-Level Fault Injection for High Performance Computing Systems","year":2018}