On-Chip Traffic Regulation to Reduce Coherence Protocol Cost on a Micro-threaded Many-Core Architecture with Distributed Caches. Yang, Q., Fu, J., Poss, R., & Jesshope, C. ACM Trans. Embed. Comput. Syst., 13(3s):103:1–103:21, ACM, New York, NY, USA, March, 2013.
On-Chip Traffic Regulation to Reduce Coherence Protocol Cost on a Micro-threaded Many-Core Architecture with Distributed Caches [link]Doi  doi  abstract   bibtex   
When hardware cache coherence scales to many cores on chip, the coherence protocol of the shared memory system may offset the benefit from massive hardware concurrency. In this article, we investigate the cost of a write-update policy in terms of on-chip memory network traffic and its adverse effects on the system performance based on a multi-threaded many-core architecture with distributed caches. We discuss possible software and hardware solutions to alleviate the network pressure without changing the protocol. We find that in the context of massive concurrency, by introducing a write-merging buffer with 0.46% area overhead to each core, applications with good locality and concurrency are boosted up by 18.74% in performance on average. Other applications also benefit from this addition and even achieve a throughput increase of 5.93%. In addition, this improvement indicates that higher levels of concurrency per core can be exploited without impacting performance, thus tolerating latency better and giving higher processor efficiencies compared to other solutions.
@article{yang13tecs,
	Abstract = {When hardware cache coherence scales to many cores on chip, the coherence protocol of the shared memory system may offset the benefit from massive hardware concurrency. In this article, we investigate the cost of a write-update policy in terms of on-chip memory network traffic and its adverse effects on the system performance based on a multi-threaded many-core architecture with distributed caches. We discuss possible software and hardware solutions to alleviate the network pressure without changing the protocol. We find that in the context of massive concurrency, by introducing a write-merging buffer with 0.46% area overhead to each core, applications with good locality and concurrency are boosted up by 18.74% in performance on average. Other applications also benefit from this addition and even achieve a throughput increase of 5.93%. In addition, this improvement indicates that higher levels of concurrency per core can be exploited without impacting performance, thus tolerating latency better and giving higher processor efficiencies compared to other solutions.},
	Acmid = {2567931},
	Address = {New York, NY, USA},
	Author = {Qiang Yang and Jian Fu and Raphael Poss and Chris Jesshope},


	Doi = {10.1145/2567931}, Urldoi = {http://dx.doi.org/10.1145/2567931},
	Issn = {1539-9087},
	Journal = {ACM Trans. Embed. Comput. Syst.},
	Month = {March},
	Number = {3s},
	Pages = {103:1--103:21},
	Publisher = {ACM},
	Title = {On-Chip Traffic Regulation to Reduce Coherence Protocol Cost on a Micro-threaded Many-Core Architecture with Distributed Caches},
	Volume = {13},
	Year = {2013},
	}

Downloads: 0