Data transformations enabling loop vectorization on multithreaded data parallel architectures. Jang, B., Mistry, P., Schaa, D., Dominguez, R., & Kaeli, D. In Principles and Practice of Parallel Programming, 2010.
Data transformations enabling loop vectorization on multithreaded data parallel architectures [link]Paper  abstract   bibtex   
Loop vectorization, a key feature exploited to obtain high performance on Single Instruction Multiple Data (SIMD) vector architectures, is significantly hindered by irregular memory access patterns in the data stream. This paper describes data transformations that allow us to vectorize loops targeting massively multithreaded data parallel architectures. We present a mathematical model that captures loop-based memory access patterns and computes the most appropriate data transformations in order to enable vectorization. Our experimental results show that the proposed data transformations can significantly increase the number of loops that can be vectorized and enhance the data-level parallelism of applications. Our results also show that the overhead associated with our data transformations can be easily amortized as the size of the input data set increases. For the set of high performance benchmark kernels studied, we achieve consistent and significant performance improvements (up to 11.4X) by applying vectorization using our data transformation approach.
@inproceedings{ Jang2010b,
  author    = {Byunghyun Jang and Perhaad Mistry and Dana Schaa and Rodrigo Dominguez and David Kaeli},
  title     = {Data transformations enabling loop vectorization on multithreaded data parallel architectures}, 
  abstract   = {Loop vectorization, a key feature exploited to obtain high performance on Single Instruction Multiple Data (SIMD) vector architectures, is significantly hindered by irregular memory access patterns in the data stream. This paper describes data transformations that allow us to vectorize loops targeting massively multithreaded data parallel architectures. We present a mathematical model that captures loop-based memory access patterns and computes the most appropriate data transformations in order to enable vectorization. Our experimental results show that the proposed data transformations can significantly increase the number of loops that can be vectorized and enhance the data-level parallelism of applications. Our results also show that the overhead associated with our data transformations can be easily amortized as the size of the input data set increases. For the set of high performance benchmark kernels studied, we achieve consistent and significant performance improvements (up to 11.4X) by applying vectorization using our data transformation approach.},
  booktitle   = {Principles and Practice of Parallel Programming},
  url   = {http://portal.acm.org/citation.cfm?id=1693453.1693510} ,
  year   = {2010}
}
Downloads: 0