High Availability on Jetstream: Practices and Lessons Learned

High Availability on Jetstream: Practices and Lessons Learned. Lowe, J., M., Fischer, J., Sudarshan, S., Turner, G., Stewart, C., A., & Hancock, D., Y. In Proceedings of the 9th Workshop on Scientific Cloud Computing (ScienceCloud'18), of ScienceCloud'18, pages 4:1--4:7, 2018. ACM.

Website doi abstract bibtex

Research computing has traditionally used high performance computing (HPC) clusters and has been a service not given to high availability without a doubling of computational and storage capacity. System maintenance such as security patching, firmware updates, and other system upgrades generally meant that the system would be unavailable for the duration of the work unless one has redundant HPC systems and storage. While efforts were often made to limit downtimes, when it became necessary, maintenance windows might be one to two hours or as much as an entire day. As the National Science Foundation (NSF) began funding non-traditional research systems, looking at ways to provide higher availability for researchers became one focus for service providers. One of the design elements of Jetstream was to have geographic dispersion to maximize availability. This was the first step in a number of design elements intended to make Jetstream exceed the NSF's availability requirements. We will examine the design steps employed, the components of the system and how the availability for each was considered in deployment, how maintenance is handled, and the lessons learned from the design and implementation of the Jetstream cloud.

@inproceedings{
 title = {High Availability on Jetstream: Practices and Lessons Learned},
 type = {inproceedings},
 year = {2018},
 keywords = {Atmosphere,XSEDE,acm reference format,atmosphere,availability,cloud,george turner,hpc,jeremy fischer,john michael lowe,research,sanjana sudarshan,xsede},
 pages = {4:1--4:7},
 websites = {http://doi.acm.org/10.1145/3217880.3217884},
 publisher = {ACM},
 city = {New York, NY, USA},
 series = {ScienceCloud'18},
 id = {5d4797f1-4581-3845-971d-edfa984ebcc2},
 created = {2019-10-01T17:20:35.320Z},
 file_attached = {false},
 profile_id = {42d295c0-0737-38d6-8b43-508cab6ea85d},
 last_modified = {2019-10-01T17:20:35.320Z},
 read = {false},
 starred = {false},
 authored = {true},
 confirmed = {true},
 hidden = {false},
 citation_key = {Lowe:2018:HAJ:3217880.3217884},
 source_type = {inproceedings},
 private_publication = {false},
 abstract = {Research computing has traditionally used high performance computing (HPC) clusters and has been a service not given to high availability without a doubling of computational and storage capacity. System maintenance such as security patching, firmware updates, and other system upgrades generally meant that the system would be unavailable for the duration of the work unless one has redundant HPC systems and storage. While efforts were often made to limit downtimes, when it became necessary, maintenance windows might be one to two hours or as much as an entire day. As the National Science Foundation (NSF) began funding non-traditional research systems, looking at ways to provide higher availability for researchers became one focus for service providers. One of the design elements of Jetstream was to have geographic dispersion to maximize availability. This was the first step in a number of design elements intended to make Jetstream exceed the NSF's availability requirements. We will examine the design steps employed, the components of the system and how the availability for each was considered in deployment, how maintenance is handled, and the lessons learned from the design and implementation of the Jetstream cloud.},
 bibtype = {inproceedings},
 author = {Lowe, John Michael and Fischer, Jeremy and Sudarshan, Sanjana and Turner, George and Stewart, Craig A and Hancock, David Y},
 doi = {10.1145/3217880.3217884},
 booktitle = {Proceedings of the 9th Workshop on Scientific Cloud Computing (ScienceCloud'18)}
}

Downloads: 0

{"_id":"i6Y2aig5PyWXZSdXS","bibbaseid":"lowe-fischer-sudarshan-turner-stewart-hancock-highavailabilityonjetstreampracticesandlessonslearned-2018","authorIDs":[],"author_short":["Lowe, J., M.","Fischer, J.","Sudarshan, S.","Turner, G.","Stewart, C., A.","Hancock, D., Y."],"bibdata":{"title":"High Availability on Jetstream: Practices and Lessons Learned","type":"inproceedings","year":"2018","keywords":"Atmosphere,XSEDE,acm reference format,atmosphere,availability,cloud,george turner,hpc,jeremy fischer,john michael lowe,research,sanjana sudarshan,xsede","pages":"4:1--4:7","websites":"http://doi.acm.org/10.1145/3217880.3217884","publisher":"ACM","city":"New York, NY, USA","series":"ScienceCloud'18","id":"5d4797f1-4581-3845-971d-edfa984ebcc2","created":"2019-10-01T17:20:35.320Z","file_attached":false,"profile_id":"42d295c0-0737-38d6-8b43-508cab6ea85d","last_modified":"2019-10-01T17:20:35.320Z","read":false,"starred":false,"authored":"true","confirmed":"true","hidden":false,"citation_key":"Lowe:2018:HAJ:3217880.3217884","source_type":"inproceedings","private_publication":false,"abstract":"Research computing has traditionally used high performance computing (HPC) clusters and has been a service not given to high availability without a doubling of computational and storage capacity. System maintenance such as security patching, firmware updates, and other system upgrades generally meant that the system would be unavailable for the duration of the work unless one has redundant HPC systems and storage. While efforts were often made to limit downtimes, when it became necessary, maintenance windows might be one to two hours or as much as an entire day. As the National Science Foundation (NSF) began funding non-traditional research systems, looking at ways to provide higher availability for researchers became one focus for service providers. One of the design elements of Jetstream was to have geographic dispersion to maximize availability. This was the first step in a number of design elements intended to make Jetstream exceed the NSF's availability requirements. We will examine the design steps employed, the components of the system and how the availability for each was considered in deployment, how maintenance is handled, and the lessons learned from the design and implementation of the Jetstream cloud.","bibtype":"inproceedings","author":"Lowe, John Michael and Fischer, Jeremy and Sudarshan, Sanjana and Turner, George and Stewart, Craig A and Hancock, David Y","doi":"10.1145/3217880.3217884","booktitle":"Proceedings of the 9th Workshop on Scientific Cloud Computing (ScienceCloud'18)","bibtex":"@inproceedings{\n title = {High Availability on Jetstream: Practices and Lessons Learned},\n type = {inproceedings},\n year = {2018},\n keywords = {Atmosphere,XSEDE,acm reference format,atmosphere,availability,cloud,george turner,hpc,jeremy fischer,john michael lowe,research,sanjana sudarshan,xsede},\n pages = {4:1--4:7},\n websites = {http://doi.acm.org/10.1145/3217880.3217884},\n publisher = {ACM},\n city = {New York, NY, USA},\n series = {ScienceCloud'18},\n id = {5d4797f1-4581-3845-971d-edfa984ebcc2},\n created = {2019-10-01T17:20:35.320Z},\n file_attached = {false},\n profile_id = {42d295c0-0737-38d6-8b43-508cab6ea85d},\n last_modified = {2019-10-01T17:20:35.320Z},\n read = {false},\n starred = {false},\n authored = {true},\n confirmed = {true},\n hidden = {false},\n citation_key = {Lowe:2018:HAJ:3217880.3217884},\n source_type = {inproceedings},\n private_publication = {false},\n abstract = {Research computing has traditionally used high performance computing (HPC) clusters and has been a service not given to high availability without a doubling of computational and storage capacity. System maintenance such as security patching, firmware updates, and other system upgrades generally meant that the system would be unavailable for the duration of the work unless one has redundant HPC systems and storage. While efforts were often made to limit downtimes, when it became necessary, maintenance windows might be one to two hours or as much as an entire day. As the National Science Foundation (NSF) began funding non-traditional research systems, looking at ways to provide higher availability for researchers became one focus for service providers. One of the design elements of Jetstream was to have geographic dispersion to maximize availability. This was the first step in a number of design elements intended to make Jetstream exceed the NSF's availability requirements. We will examine the design steps employed, the components of the system and how the availability for each was considered in deployment, how maintenance is handled, and the lessons learned from the design and implementation of the Jetstream cloud.},\n bibtype = {inproceedings},\n author = {Lowe, John Michael and Fischer, Jeremy and Sudarshan, Sanjana and Turner, George and Stewart, Craig A and Hancock, David Y},\n doi = {10.1145/3217880.3217884},\n booktitle = {Proceedings of the 9th Workshop on Scientific Cloud Computing (ScienceCloud'18)}\n}","author_short":["Lowe, J., M.","Fischer, J.","Sudarshan, S.","Turner, G.","Stewart, C., A.","Hancock, D., Y."],"urls":{"Website":"http://doi.acm.org/10.1145/3217880.3217884"},"biburl":"https://bibbase.org/service/mendeley/42d295c0-0737-38d6-8b43-508cab6ea85d","bibbaseid":"lowe-fischer-sudarshan-turner-stewart-hancock-highavailabilityonjetstreampracticesandlessonslearned-2018","role":"author","keyword":["Atmosphere","XSEDE","acm reference format","atmosphere","availability","cloud","george turner","hpc","jeremy fischer","john michael lowe","research","sanjana sudarshan","xsede"],"metadata":{"authorlinks":{}},"downloads":0},"bibtype":"inproceedings","creationDate":"2019-07-01T19:04:21.039Z","downloads":0,"keywords":["atmosphere","xsede","acm reference format","atmosphere","availability","cloud","george turner","hpc","jeremy fischer","john michael lowe","research","sanjana sudarshan","xsede"],"search_terms":["high","availability","jetstream","practices","lessons","learned","lowe","fischer","sudarshan","turner","stewart","hancock"],"title":"High Availability on Jetstream: Practices and Lessons Learned","year":2018,"biburl":"https://bibbase.org/service/mendeley/42d295c0-0737-38d6-8b43-508cab6ea85d","dataSources":["zgahneP4uAjKbudrQ","ya2CyA73rpZseyrZ8","2252seNhipfTmjEBQ"]}