PeerWave: Exploiting Wavefront Parallelism on GPUs with Peer-SM Synchronization

PeerWave: Exploiting Wavefront Parallelism on GPUs with Peer-SM Synchronization. Belviranli, M. E., Deng, P., Bhuyan, L. N., Gupta, R., & Zhu, Q. In Proceedings of the 29th ACM on International Conference on Supercomputing, of ICS '15, pages 25–35, New York, NY, USA, June, 2015. Association for Computing Machinery.

Paper doi abstract bibtex

Nested loops with regular iteration dependencies span a large class of applications ranging from string matching to linear system solvers. Wavefront parallelism is a well-known technique to enable concurrent processing of such applications and is widely being used on GPUs to benefit from their massively parallel computing capabilities. Wavefront parallelism on GPUs uses global barriers between processing of tiles to enforce data dependencies. However, such diagonal-wide synchronization causes load imbalance by forcing SMs to wait for the completion of the SM with longest computation. Moreover, diagonal processing causes loss of locality due to elements that border adjacent tiles. In this paper, we propose PeerWave, an alternative GPU wavefront parallelization technique that improves inter-SM load balance by using peer-wise synchronization between SMs. and eliminating global synchronization. Our approach also increases GPU L2 cache locality through row allocation of tiles to the SMs. We further improve PeerWave performance by using flexible hyper-tiles that reduce inter-SM wait time while maximizing intra-SM utilization. We develop an analytical model for determining the optimal tile size. Finally, we present a run-time and a CUDA based API to allow users to easily implement their applications using PeerWave. We evaluate PeerWave on the NVIDIA K40c GPU using 6 different applications and achieve speedups of up to 2X compared to the most recent hyperplane transformation based GPU implementation.

@inproceedings{belviranli_peerwave_2015,
	address = {New York, NY, USA},
	series = {{ICS} '15},
	title = {{PeerWave}: {Exploiting} {Wavefront} {Parallelism} on {GPUs} with {Peer}-{SM} {Synchronization}},
	isbn = {978-1-4503-3559-1},
	shorttitle = {{PeerWave}},
	url = {https://doi.org/10.1145/2751205.2751243},
	doi = {10.1145/2751205.2751243},
	abstract = {Nested loops with regular iteration dependencies span a large class of applications ranging from string matching to linear system solvers. Wavefront parallelism is a well-known technique to enable concurrent processing of such applications and is widely being used on GPUs to benefit from their massively parallel computing capabilities. Wavefront parallelism on GPUs uses global barriers between processing of tiles to enforce data dependencies. However, such diagonal-wide synchronization causes load imbalance by forcing SMs to wait for the completion of the SM with longest computation. Moreover, diagonal processing causes loss of locality due to elements that border adjacent tiles. In this paper, we propose PeerWave, an alternative GPU wavefront parallelization technique that improves inter-SM load balance by using peer-wise synchronization between SMs. and eliminating global synchronization. Our approach also increases GPU L2 cache locality through row allocation of tiles to the SMs. We further improve PeerWave performance by using flexible hyper-tiles that reduce inter-SM wait time while maximizing intra-SM utilization. We develop an analytical model for determining the optimal tile size. Finally, we present a run-time and a CUDA based API to allow users to easily implement their applications using PeerWave. We evaluate PeerWave on the NVIDIA K40c GPU using 6 different applications and achieve speedups of up to 2X compared to the most recent hyperplane transformation based GPU implementation.},
	urldate = {2022-05-05},
	booktitle = {Proceedings of the 29th {ACM} on {International} {Conference} on {Supercomputing}},
	publisher = {Association for Computing Machinery},
	author = {Belviranli, Mehmet E. and Deng, Peng and Bhuyan, Laxmi N. and Gupta, Rajiv and Zhu, Qi},
	month = jun,
	year = {2015},
	keywords = {decentralized synchronization, gp-gpu computing, wavefront parallelism},
	pages = {25--35},
}

Downloads: 0

{"_id":"8CTpzGrhPhyECKToH","bibbaseid":"belviranli-deng-bhuyan-gupta-zhu-peerwaveexploitingwavefrontparallelismongpuswithpeersmsynchronization-2015","author_short":["Belviranli, M. E.","Deng, P.","Bhuyan, L. N.","Gupta, R.","Zhu, Q."],"bibdata":{"bibtype":"inproceedings","type":"inproceedings","address":"New York, NY, USA","series":"ICS '15","title":"PeerWave: Exploiting Wavefront Parallelism on GPUs with Peer-SM Synchronization","isbn":"978-1-4503-3559-1","shorttitle":"PeerWave","url":"https://doi.org/10.1145/2751205.2751243","doi":"10.1145/2751205.2751243","abstract":"Nested loops with regular iteration dependencies span a large class of applications ranging from string matching to linear system solvers. Wavefront parallelism is a well-known technique to enable concurrent processing of such applications and is widely being used on GPUs to benefit from their massively parallel computing capabilities. Wavefront parallelism on GPUs uses global barriers between processing of tiles to enforce data dependencies. However, such diagonal-wide synchronization causes load imbalance by forcing SMs to wait for the completion of the SM with longest computation. Moreover, diagonal processing causes loss of locality due to elements that border adjacent tiles. In this paper, we propose PeerWave, an alternative GPU wavefront parallelization technique that improves inter-SM load balance by using peer-wise synchronization between SMs. and eliminating global synchronization. Our approach also increases GPU L2 cache locality through row allocation of tiles to the SMs. We further improve PeerWave performance by using flexible hyper-tiles that reduce inter-SM wait time while maximizing intra-SM utilization. We develop an analytical model for determining the optimal tile size. Finally, we present a run-time and a CUDA based API to allow users to easily implement their applications using PeerWave. We evaluate PeerWave on the NVIDIA K40c GPU using 6 different applications and achieve speedups of up to 2X compared to the most recent hyperplane transformation based GPU implementation.","urldate":"2022-05-05","booktitle":"Proceedings of the 29th ACM on International Conference on Supercomputing","publisher":"Association for Computing Machinery","author":[{"propositions":[],"lastnames":["Belviranli"],"firstnames":["Mehmet","E."],"suffixes":[]},{"propositions":[],"lastnames":["Deng"],"firstnames":["Peng"],"suffixes":[]},{"propositions":[],"lastnames":["Bhuyan"],"firstnames":["Laxmi","N."],"suffixes":[]},{"propositions":[],"lastnames":["Gupta"],"firstnames":["Rajiv"],"suffixes":[]},{"propositions":[],"lastnames":["Zhu"],"firstnames":["Qi"],"suffixes":[]}],"month":"June","year":"2015","keywords":"decentralized synchronization, gp-gpu computing, wavefront parallelism","pages":"25–35","bibtex":"@inproceedings{belviranli_peerwave_2015,\n\taddress = {New York, NY, USA},\n\tseries = {{ICS} '15},\n\ttitle = {{PeerWave}: {Exploiting} {Wavefront} {Parallelism} on {GPUs} with {Peer}-{SM} {Synchronization}},\n\tisbn = {978-1-4503-3559-1},\n\tshorttitle = {{PeerWave}},\n\turl = {https://doi.org/10.1145/2751205.2751243},\n\tdoi = {10.1145/2751205.2751243},\n\tabstract = {Nested loops with regular iteration dependencies span a large class of applications ranging from string matching to linear system solvers. Wavefront parallelism is a well-known technique to enable concurrent processing of such applications and is widely being used on GPUs to benefit from their massively parallel computing capabilities. Wavefront parallelism on GPUs uses global barriers between processing of tiles to enforce data dependencies. However, such diagonal-wide synchronization causes load imbalance by forcing SMs to wait for the completion of the SM with longest computation. Moreover, diagonal processing causes loss of locality due to elements that border adjacent tiles. In this paper, we propose PeerWave, an alternative GPU wavefront parallelization technique that improves inter-SM load balance by using peer-wise synchronization between SMs. and eliminating global synchronization. Our approach also increases GPU L2 cache locality through row allocation of tiles to the SMs. We further improve PeerWave performance by using flexible hyper-tiles that reduce inter-SM wait time while maximizing intra-SM utilization. We develop an analytical model for determining the optimal tile size. Finally, we present a run-time and a CUDA based API to allow users to easily implement their applications using PeerWave. We evaluate PeerWave on the NVIDIA K40c GPU using 6 different applications and achieve speedups of up to 2X compared to the most recent hyperplane transformation based GPU implementation.},\n\turldate = {2022-05-05},\n\tbooktitle = {Proceedings of the 29th {ACM} on {International} {Conference} on {Supercomputing}},\n\tpublisher = {Association for Computing Machinery},\n\tauthor = {Belviranli, Mehmet E. and Deng, Peng and Bhuyan, Laxmi N. and Gupta, Rajiv and Zhu, Qi},\n\tmonth = jun,\n\tyear = {2015},\n\tkeywords = {decentralized synchronization, gp-gpu computing, wavefront parallelism},\n\tpages = {25--35},\n}\n\n","author_short":["Belviranli, M. E.","Deng, P.","Bhuyan, L. N.","Gupta, R.","Zhu, Q."],"key":"belviranli_peerwave_2015","id":"belviranli_peerwave_2015","bibbaseid":"belviranli-deng-bhuyan-gupta-zhu-peerwaveexploitingwavefrontparallelismongpuswithpeersmsynchronization-2015","role":"author","urls":{"Paper":"https://doi.org/10.1145/2751205.2751243"},"keyword":["decentralized synchronization","gp-gpu computing","wavefront parallelism"],"metadata":{"authorlinks":{}}},"bibtype":"inproceedings","biburl":"https://api.zotero.org/groups/4682325/items?key=gQcofis3fp4rTPEF0VitS7qe&format=bibtex&limit=100","dataSources":["rbZEZqmLz8K8iv4Hi"],"keywords":["decentralized synchronization","gp-gpu computing","wavefront parallelism"],"search_terms":["peerwave","exploiting","wavefront","parallelism","gpus","peer","synchronization","belviranli","deng","bhuyan","gupta","zhu"],"title":"PeerWave: Exploiting Wavefront Parallelism on GPUs with Peer-SM Synchronization","year":2015}