Revisiting the Straggling Problem in GPU-based Distributed Deep Learning Training

Revisiting the Straggling Problem in GPU-based Distributed Deep Learning Training. Tairin, S., Zhang, Z., & Shen, H. In The Proceedings of the 34th International Conference on Computer Communications and Networks (ICCCN 2025), Tokyo, Japan, August, 2025. Code: https://github.com/pcl-projects/STRET
abstract bibtex

The straggler problem has been extensively studied in CPU-based distributed deep learning (DL) training but has not received significant attention in homogeneous GPU-based distributed training, possibly because GPUs do not typically become bottlenecks in this scenario. In this paper, we conduct experiment measurements and find that the straggler problems persist in this scenario, primarily stemming from communication hurdles, compounded by computation delays, and stragglers substantially inflate resource consumption and training time by ∼50%. Existing straggler mitigation methods do not directly address the communication stragglers in this scenario, and they suffer from drawbacks such as prolonged latency in straggler removal, high resource consumption, or compromised training accuracy. To tackle these limitations, based on the insights derived from thorough measurements, we propose a Straggler-aware Time and Resource Efficient distributed DL Training system (STRET). STRET is tailored for both homogeneous and heterogeneous GPU-based distributed training, encompassing both the parameter server (PS) and all-reduce architectures. It creates a hybrid architecture that connects a straggler to a non-straggler possessing high communication bandwidth with it to reduce communication delay. If this method fails to eliminate stragglers, it runs two complementary methods in sequence to remove the stragglers. First, it further reduces communication overhead by withholding reporting gradients when the accuracy increment is marginal. Second, it conducts one-time batch size tuning to reduce iteration time. Real experimental results on TensorFlow show that STRET can reduce up to 56% and 41% training time and save up to 94% and 96% resources in the heterogeneous and homogeneous scenarios, respectively, compared to state-of-the-art approaches while preserving accuracy.

@inproceedings{tairin_revisiting_2025,
	address = {Tokyo, Japan},
	title = {Revisiting the {Straggling} {Problem} in {GPU}-based {Distributed} {Deep} {Learning} {Training}},
	abstract = {The straggler problem has been extensively studied in CPU-based distributed deep learning (DL) training but has not received significant attention in homogeneous GPU-based distributed training, possibly because GPUs do not typically become bottlenecks in this scenario. In this paper, we conduct experiment measurements and find that the straggler problems persist in this scenario, primarily stemming from communication hurdles, compounded by computation delays, and stragglers substantially inflate resource consumption and training time by ∼50\%. Existing straggler mitigation methods do not directly address the communication stragglers in this scenario, and they suffer from drawbacks such as prolonged latency in straggler removal, high resource consumption, or compromised training accuracy. To tackle these limitations, based on the insights derived from thorough measurements, we propose a Straggler-aware Time and Resource Efficient distributed DL Training system (STRET). STRET is tailored for both homogeneous and heterogeneous GPU-based distributed training, encompassing both the parameter server (PS) and all-reduce architectures. It creates a hybrid architecture that connects a straggler to a non-straggler possessing high communication bandwidth with it to reduce communication delay. If this method fails to eliminate stragglers, it runs two complementary methods in sequence to remove the stragglers. First, it further reduces communication overhead by withholding reporting gradients when the accuracy increment is marginal. Second, it conducts one-time batch size tuning to reduce iteration time. Real experimental results on TensorFlow show that STRET can reduce up to 56\% and 41\% training time and save up to 94\% and 96\% resources in the heterogeneous and homogeneous scenarios, respectively, compared to state-of-the-art approaches while preserving accuracy.},
	language = {en},
	booktitle = {The {Proceedings} of the 34th {International} {Conference} on {Computer} {Communications} and {Networks} ({ICCCN} 2025)},
	author = {Tairin, Suraiya and Zhang, Zeyu and Shen, Haiying},
	month = aug,
	year = {2025},
	note = {Code: https://github.com/pcl-projects/STRET},
	keywords = {Explorable, Foundational, SYS: CosmicAI Contact Author, WG: Explorable},
}

Downloads: 0

{"_id":"dqW3E8Mvm6aT8pKpm","bibbaseid":"tairin-zhang-shen-revisitingthestragglingproblemingpubaseddistributeddeeplearningtraining-2025","author_short":["Tairin, S.","Zhang, Z.","Shen, H."],"bibdata":{"bibtype":"inproceedings","type":"inproceedings","address":"Tokyo, Japan","title":"Revisiting the Straggling Problem in GPU-based Distributed Deep Learning Training","abstract":"The straggler problem has been extensively studied in CPU-based distributed deep learning (DL) training but has not received significant attention in homogeneous GPU-based distributed training, possibly because GPUs do not typically become bottlenecks in this scenario. In this paper, we conduct experiment measurements and find that the straggler problems persist in this scenario, primarily stemming from communication hurdles, compounded by computation delays, and stragglers substantially inflate resource consumption and training time by ∼50%. Existing straggler mitigation methods do not directly address the communication stragglers in this scenario, and they suffer from drawbacks such as prolonged latency in straggler removal, high resource consumption, or compromised training accuracy. To tackle these limitations, based on the insights derived from thorough measurements, we propose a Straggler-aware Time and Resource Efficient distributed DL Training system (STRET). STRET is tailored for both homogeneous and heterogeneous GPU-based distributed training, encompassing both the parameter server (PS) and all-reduce architectures. It creates a hybrid architecture that connects a straggler to a non-straggler possessing high communication bandwidth with it to reduce communication delay. If this method fails to eliminate stragglers, it runs two complementary methods in sequence to remove the stragglers. First, it further reduces communication overhead by withholding reporting gradients when the accuracy increment is marginal. Second, it conducts one-time batch size tuning to reduce iteration time. Real experimental results on TensorFlow show that STRET can reduce up to 56% and 41% training time and save up to 94% and 96% resources in the heterogeneous and homogeneous scenarios, respectively, compared to state-of-the-art approaches while preserving accuracy.","language":"en","booktitle":"The Proceedings of the 34th International Conference on Computer Communications and Networks (ICCCN 2025)","author":[{"propositions":[],"lastnames":["Tairin"],"firstnames":["Suraiya"],"suffixes":[]},{"propositions":[],"lastnames":["Zhang"],"firstnames":["Zeyu"],"suffixes":[]},{"propositions":[],"lastnames":["Shen"],"firstnames":["Haiying"],"suffixes":[]}],"month":"August","year":"2025","note":"Code: https://github.com/pcl-projects/STRET","keywords":"Explorable, Foundational, SYS: CosmicAI Contact Author, WG: Explorable","bibtex":"@inproceedings{tairin_revisiting_2025,\n\taddress = {Tokyo, Japan},\n\ttitle = {Revisiting the {Straggling} {Problem} in {GPU}-based {Distributed} {Deep} {Learning} {Training}},\n\tabstract = {The straggler problem has been extensively studied in CPU-based distributed deep learning (DL) training but has not received significant attention in homogeneous GPU-based distributed training, possibly because GPUs do not typically become bottlenecks in this scenario. In this paper, we conduct experiment measurements and find that the straggler problems persist in this scenario, primarily stemming from communication hurdles, compounded by computation delays, and stragglers substantially inflate resource consumption and training time by ∼50\\%. Existing straggler mitigation methods do not directly address the communication stragglers in this scenario, and they suffer from drawbacks such as prolonged latency in straggler removal, high resource consumption, or compromised training accuracy. To tackle these limitations, based on the insights derived from thorough measurements, we propose a Straggler-aware Time and Resource Efficient distributed DL Training system (STRET). STRET is tailored for both homogeneous and heterogeneous GPU-based distributed training, encompassing both the parameter server (PS) and all-reduce architectures. It creates a hybrid architecture that connects a straggler to a non-straggler possessing high communication bandwidth with it to reduce communication delay. If this method fails to eliminate stragglers, it runs two complementary methods in sequence to remove the stragglers. First, it further reduces communication overhead by withholding reporting gradients when the accuracy increment is marginal. Second, it conducts one-time batch size tuning to reduce iteration time. Real experimental results on TensorFlow show that STRET can reduce up to 56\\% and 41\\% training time and save up to 94\\% and 96\\% resources in the heterogeneous and homogeneous scenarios, respectively, compared to state-of-the-art approaches while preserving accuracy.},\n\tlanguage = {en},\n\tbooktitle = {The {Proceedings} of the 34th {International} {Conference} on {Computer} {Communications} and {Networks} ({ICCCN} 2025)},\n\tauthor = {Tairin, Suraiya and Zhang, Zeyu and Shen, Haiying},\n\tmonth = aug,\n\tyear = {2025},\n\tnote = {Code: https://github.com/pcl-projects/STRET},\n\tkeywords = {Explorable, Foundational, SYS: CosmicAI Contact Author, WG: Explorable},\n}\n\n\n\n","author_short":["Tairin, S.","Zhang, Z.","Shen, H."],"key":"tairin_revisiting_2025","id":"tairin_revisiting_2025","bibbaseid":"tairin-zhang-shen-revisitingthestragglingproblemingpubaseddistributeddeeplearningtraining-2025","role":"author","urls":{},"keyword":["Explorable","Foundational","SYS: CosmicAI Contact Author","WG: Explorable"],"metadata":{"authorlinks":{}}},"bibtype":"inproceedings","biburl":"https://bibbase.org/zotero-group/pratikmhatre/5933976","dataSources":["yJr5AAtJ5Sz3Q4WT4"],"keywords":["explorable","foundational","sys: cosmicai contact author","wg: explorable"],"search_terms":["revisiting","straggling","problem","gpu","based","distributed","deep","learning","training","tairin","zhang","shen"],"title":"Revisiting the Straggling Problem in GPU-based Distributed Deep Learning Training","year":2025}