Evaluating Deep Learning Recommendation Model Training Scalability with the Dynamic Opera Network. Imes, C., Rittenbach, A., Xie, P., Kang, D. I. D., Walters, J. P., & Crago, S. P. In Proceedings of the 4th Workshop on Machine Learning and Systems, of EuroMLSys '24, pages 169–175, New York, NY, USA, 2024. Association for Computing Machinery.
Evaluating Deep Learning Recommendation Model Training Scalability with the Dynamic Opera Network [link]Paper  doi  abstract   bibtex   
Deep learning is commonly used to make personalized recommendations to users for a wide variety of activities. However, deep learning recommendation model (DLRM) training is increasingly dominated by all-to-all and many-to-many communication patterns. While there are a wide variety of algorithms to efficiently overlap communication and computation for many collective operations, these patterns are strictly limited by network bottlenecks. We propose co-designing DLRM model training with the recently proposed Opera network, which is designed to avoid multiple network hops using time-varying source-to-destination circuits. Using measurements from state-of-the-art NVIDIA A100 GPUs, we simulate DLRM model training on networks ranging from 16 to 1024 nodes and demonstrate up to 1.79× improvement using Opera compared with equivalent fat-tree networks. We identify important parameters affecting training time and demonstrate that careful co-design is needed to optimize training latency.
@inproceedings{OperaDLRM,
author = {Imes, Connor and Rittenbach, Andrew and Xie, Peng and Kang, Dong In D. and Walters, John Paul and Crago, Stephen P.},
title = {Evaluating Deep Learning Recommendation Model Training Scalability with the Dynamic Opera Network},
year = {2024},
isbn = {9798400705410},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3642970.3655825},
doi = {10.1145/3642970.3655825},
abstract = {Deep learning is commonly used to make personalized recommendations to users for a wide variety of activities. However, deep learning recommendation model (DLRM) training is increasingly dominated by all-to-all and many-to-many communication patterns. While there are a wide variety of algorithms to efficiently overlap communication and computation for many collective operations, these patterns are strictly limited by network bottlenecks. We propose co-designing DLRM model training with the recently proposed Opera network, which is designed to avoid multiple network hops using time-varying source-to-destination circuits. Using measurements from state-of-the-art NVIDIA A100 GPUs, we simulate DLRM model training on networks ranging from 16 to 1024 nodes and demonstrate up to 1.79\texttimes{} improvement using Opera compared with equivalent fat-tree networks. We identify important parameters affecting training time and demonstrate that careful co-design is needed to optimize training latency.},
booktitle = {Proceedings of the 4th Workshop on Machine Learning and Systems},
pages = {169–175},
numpages = {7},
keywords = {deep learning, dynamic networks, machine learning, networks, recommendation models},
location = {, Athens, Greece, },
series = {EuroMLSys '24},
ISIArea = {ML, CAS, NET}
}

Downloads: 0