Co-evolving Tracing and Fault Injection with Box of Pain

Co-evolving Tracing and Fault Injection with Box of Pain. Bittman, D., Miller, E. L, Cui, M., Alvaro, P., He, M., Edupuganti, S., Nayak, N., Sukhomlinov, V., Raphael, R., Shlomo, R., & others In 11th $\{$USENIX$\}$ Workshop on Hot Topics in Cloud Computing (HotCloud 19), 2019.
bibtex

@InProceedings{bittman19co,
  author    = {Bittman, Daniel and Miller, Ethan L and Cui, Michael and Alvaro, Peter and He, Michael and Edupuganti, Saikrishna and Nayak, Naren and Sukhomlinov, Vadim and Raphael, Roger and Shlomo, Roee and others},
  title     = {Co-evolving Tracing and Fault Injection with Box of Pain},
  booktitle = {11th $\{$USENIX$\}$ Workshop on Hot Topics in Cloud Computing (HotCloud 19)},
  year      = {2019},
  comment   = {* fault injection in distributed systems to enhance fault tolerance

  * i. e., not only find but support improving fault tolerance

* try to find an approach where systems need injection and tracing
  infrastructure
* so they built a tracing framework

  * by observing, e.g., communication, crashes, and exit conditions
  * assume that any failures and failure conditions can be observed
    in the distributed systems' communication

    * hence, they trace and analyze communication only

      * to form a communication graph, a partial order, etc.
      * on system-call level

    * argue, that this is a good trade-off

      * between generality
      * ease of use
      * understanding application-level semantics

* focus on simulating partial failures

  * instead of failures themselves

* describe other tracing approaches in related work
* fault model: timing and crash
* early experiments with Redis

  * injecting in GET/SET requests
  * plot distribution of unique "runs"

    * a "run" is an execution of the experiment and identified
      by the trace of system calls

      * \# I guess

    * i.e., "how often did a certain sequence of syscalls occur"

      * sequence of syscalls maps to communication pattern

        * in distributed system / between nodes

      * within 2000 iterations

  * it shows that most of the time, the system behaves consistently

    * i.e., same injection tends to lead to same sequence of syscalls},
  file      = {:bittman19co - Co-evolving Tracing and Fault Injection with Box of Pain.pdf:PDF},
  groups    = {fault injection, fault injection tools},
  timestamp = {2019-06-20},
}

Downloads: 0

{"_id":"AxCKNKfeRe6kDfw5G","bibbaseid":"bittman-miller-cui-alvaro-he-edupuganti-nayak-sukhomlinov-etal-coevolvingtracingandfaultinjectionwithboxofpain-2019","author_short":["Bittman, D.","Miller, E. L","Cui, M.","Alvaro, P.","He, M.","Edupuganti, S.","Nayak, N.","Sukhomlinov, V.","Raphael, R.","Shlomo, R.","others"],"bibdata":{"bibtype":"inproceedings","type":"inproceedings","author":[{"propositions":[],"lastnames":["Bittman"],"firstnames":["Daniel"],"suffixes":[]},{"propositions":[],"lastnames":["Miller"],"firstnames":["Ethan","L"],"suffixes":[]},{"propositions":[],"lastnames":["Cui"],"firstnames":["Michael"],"suffixes":[]},{"propositions":[],"lastnames":["Alvaro"],"firstnames":["Peter"],"suffixes":[]},{"propositions":[],"lastnames":["He"],"firstnames":["Michael"],"suffixes":[]},{"propositions":[],"lastnames":["Edupuganti"],"firstnames":["Saikrishna"],"suffixes":[]},{"propositions":[],"lastnames":["Nayak"],"firstnames":["Naren"],"suffixes":[]},{"propositions":[],"lastnames":["Sukhomlinov"],"firstnames":["Vadim"],"suffixes":[]},{"propositions":[],"lastnames":["Raphael"],"firstnames":["Roger"],"suffixes":[]},{"propositions":[],"lastnames":["Shlomo"],"firstnames":["Roee"],"suffixes":[]},{"firstnames":[],"propositions":[],"lastnames":["others"],"suffixes":[]}],"title":"Co-evolving Tracing and Fault Injection with Box of Pain","booktitle":"11th $\\{$USENIX$\\}$ Workshop on Hot Topics in Cloud Computing (HotCloud 19)","year":"2019","comment":"* fault injection in distributed systems to enhance fault tolerance * i. e., not only find but support improving fault tolerance * try to find an approach where systems need injection and tracing infrastructure * so they built a tracing framework * by observing, e.g., communication, crashes, and exit conditions * assume that any failures and failure conditions can be observed in the distributed systems' communication * hence, they trace and analyze communication only * to form a communication graph, a partial order, etc. * on system-call level * argue, that this is a good trade-off * between generality * ease of use * understanding application-level semantics * focus on simulating partial failures * instead of failures themselves * describe other tracing approaches in related work * fault model: timing and crash * early experiments with Redis * injecting in GET/SET requests * plot distribution of unique \"runs\" * a \"run\" is an execution of the experiment and identified by the trace of system calls * # I guess * i.e., \"how often did a certain sequence of syscalls occur\" * sequence of syscalls maps to communication pattern * in distributed system / between nodes * within 2000 iterations * it shows that most of the time, the system behaves consistently * i.e., same injection tends to lead to same sequence of syscalls","file":":bittman19co - Co-evolving Tracing and Fault Injection with Box of Pain.pdf:PDF","groups":"fault injection, fault injection tools","timestamp":"2019-06-20","bibtex":"@InProceedings{bittman19co,\n author = {Bittman, Daniel and Miller, Ethan L and Cui, Michael and Alvaro, Peter and He, Michael and Edupuganti, Saikrishna and Nayak, Naren and Sukhomlinov, Vadim and Raphael, Roger and Shlomo, Roee and others},\n title = {Co-evolving Tracing and Fault Injection with Box of Pain},\n booktitle = {11th $\\{$USENIX$\\}$ Workshop on Hot Topics in Cloud Computing (HotCloud 19)},\n year = {2019},\n comment = {* fault injection in distributed systems to enhance fault tolerance\n\n * i. e., not only find but support improving fault tolerance\n\n* try to find an approach where systems need injection and tracing\n infrastructure\n* so they built a tracing framework\n\n * by observing, e.g., communication, crashes, and exit conditions\n * assume that any failures and failure conditions can be observed\n in the distributed systems' communication\n\n * hence, they trace and analyze communication only\n\n * to form a communication graph, a partial order, etc.\n * on system-call level\n\n * argue, that this is a good trade-off\n\n * between generality\n * ease of use\n * understanding application-level semantics\n\n* focus on simulating partial failures\n\n * instead of failures themselves\n\n* describe other tracing approaches in related work\n* fault model: timing and crash\n* early experiments with Redis\n\n * injecting in GET/SET requests\n * plot distribution of unique \"runs\"\n\n * a \"run\" is an execution of the experiment and identified\n by the trace of system calls\n\n * \\# I guess\n\n * i.e., \"how often did a certain sequence of syscalls occur\"\n\n * sequence of syscalls maps to communication pattern\n\n * in distributed system / between nodes\n\n * within 2000 iterations\n\n * it shows that most of the time, the system behaves consistently\n\n * i.e., same injection tends to lead to same sequence of syscalls},\n file = {:bittman19co - Co-evolving Tracing and Fault Injection with Box of Pain.pdf:PDF},\n groups = {fault injection, fault injection tools},\n timestamp = {2019-06-20},\n}\n\n","author_short":["Bittman, D.","Miller, E. L","Cui, M.","Alvaro, P.","He, M.","Edupuganti, S.","Nayak, N.","Sukhomlinov, V.","Raphael, R.","Shlomo, R.","others"],"key":"bittman19co","id":"bittman19co","bibbaseid":"bittman-miller-cui-alvaro-he-edupuganti-nayak-sukhomlinov-etal-coevolvingtracingandfaultinjectionwithboxofpain-2019","role":"author","urls":{},"metadata":{"authorlinks":{}},"downloads":0,"html":""},"bibtype":"inproceedings","biburl":"https://bibbase.org/network/files/AsPiHTmHHGjgy6xSQ","dataSources":["wjZw5s4JL49uLwn3p"],"keywords":[],"search_terms":["evolving","tracing","fault","injection","box","pain","bittman","miller","cui","alvaro","he","edupuganti","nayak","sukhomlinov","raphael","shlomo","others"],"title":"Co-evolving Tracing and Fault Injection with Box of Pain","year":2019}