\n \n \n
\n
\n\n \n \n \n \n \n Computing Generalized Matrix Inverse on Spiking Neural Substrate.\n \n \n \n\n\n \n Shukla, R.; KhoramS, S.; Jorgensen, E.; Li, J.; Lipasti, M.; and Wright, S.\n\n\n \n\n\n\n
Frontiers in neuroscience: Neuromorphic engineering, 12: 115. Feb 2018.\n
\n\n
\n\n
\n\n
\n\n \n\n \n \n doi\n \n \n\n \n link\n \n \n\n bibtex\n \n\n \n \n \n abstract \n \n\n \n \n \n 1 download\n \n \n\n \n \n \n \n \n \n \n\n \n \n \n \n \n\n\n\n
\n
@article{shukla2018frontiers,\ntitle = {Computing Generalized Matrix Inverse on Spiking Neural Substrate},\nauthor = {Rohit Shukla and Soroosh Khoram<sup>S</sup> and Erik Jorgensen and Jing Li and Mikko Lipasti and Stephen Wright},\nyear = {2018},\njournal = {Frontiers in neuroscience: Neuromorphic engineering},\n%pubstate = {forthcoming},\nvolume={12},\n%number={},\npages={115},\nyear={2018},\nmonth={Feb},\ndate={2018-02-14},\ndoi={10.3389/fnins.2018.00115},\nabstract={Emerging neural hardware substrates, such as IBM's TrueNorth Neurosynaptic System, can provide an appealing platform for deploying numerical algorithms. For example, a recurrent Hopfield neural network can be used to find the Moore-Penrose generalized inverse of a matrix, thus enabling a broad class of linear optimizations to be solved efficiently, at low energy cost. However, deploying numerical algorithms on hardware platforms that severely limit the range and precision of representation for numeric quantities can be quite challenging. This paper discusses these challenges and proposes a rigorous mathematical framework for reasoning about range and precision on such substrates. The paper derives techniques for normalizing inputs and properly quantizing synaptic weights originating from arbitrary systems of linear equations, so that solvers for those systems can be implemented in a provably correct manner on hardware-constrained neural substrates. The analytical model is empirically validated on the IBM TrueNorth platform, and results show that the guarantees provided by the framework for range and precision hold under experimental conditions. Experiments with optical flow demonstrate the energy benefits of deploying a reduced-precision and energy-efficient generalized matrix inverse engine on the IBM TrueNorth platform, reflecting 10× to 100× improvement over FPGA and ARM core baselines.},\nISSN={1662-453X}, \nkeywords={journal}\n}\n\n
\n
\n\n\n
\n Emerging neural hardware substrates, such as IBM's TrueNorth Neurosynaptic System, can provide an appealing platform for deploying numerical algorithms. For example, a recurrent Hopfield neural network can be used to find the Moore-Penrose generalized inverse of a matrix, thus enabling a broad class of linear optimizations to be solved efficiently, at low energy cost. However, deploying numerical algorithms on hardware platforms that severely limit the range and precision of representation for numeric quantities can be quite challenging. This paper discusses these challenges and proposes a rigorous mathematical framework for reasoning about range and precision on such substrates. The paper derives techniques for normalizing inputs and properly quantizing synaptic weights originating from arbitrary systems of linear equations, so that solvers for those systems can be implemented in a provably correct manner on hardware-constrained neural substrates. The analytical model is empirically validated on the IBM TrueNorth platform, and results show that the guarantees provided by the framework for range and precision hold under experimental conditions. Experiments with optical flow demonstrate the energy benefits of deploying a reduced-precision and energy-efficient generalized matrix inverse engine on the IBM TrueNorth platform, reflecting 10× to 100× improvement over FPGA and ARM core baselines.\n
\n\n\n
\n\n\n
\n
\n\n \n \n \n \n \n Specialization: A New Path towards Low Power (invited).\n \n \n \n\n\n \n ZhaS, Y.; and Li, J.\n\n\n \n\n\n\n
ASP Journal of Low Power Electronics, 2018, 14(2). 2018.\n
\n\n
\n\n
\n\n
\n\n \n\n \n \n doi\n \n \n\n \n link\n \n \n\n bibtex\n \n\n \n\n \n \n \n 3 downloads\n \n \n\n \n \n \n \n \n \n \n\n \n \n \n \n \n\n\n\n
\n
@article{zha2018JOLPE,\ntitle = {Specialization: A New Path towards Low Power (<strong>invited</strong>)},\nauthor = {Yue Zha<sup>S</sup> and Jing Li},\nyear = {2018},\ndate = {2018-02-15},\njournal = {ASP Journal of Low Power Electronics, 2018},\nvolume = {14},\nnumber = {2},\n%pubstate = {forthcoming},\ntppubtype = {article},\ndoi={10.1166/jolpe.2018.1559},\nkeywords={journal}\n}\n\n
\n
\n\n\n\n
\n\n\n
\n
\n\n \n \n \n \n \n An Alternative Analytical Approach to Associative Processing (Best of CAL).\n \n \n \n\n\n \n KhoramS, S.; ZhaS, Y.; and Li, J.\n\n\n \n\n\n\n
IEEE Computer Architecture Letters, 17(2): 113-116. July 2018.\n
\n\n
\n\n
\n\n
\n\n \n\n \n \n doi\n \n \n\n \n link\n \n \n\n bibtex\n \n\n \n \n \n abstract \n \n\n \n \n \n 1 download\n \n \n\n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n\n\n
\n
@ARTICLE{Khoram2018CAL, \nauthor={Khoram<sup>S</sup>, Soroosh and Zha<sup>S</sup>, Yue and Li, Jing}, \njournal={IEEE Computer Architecture Letters}, \ntitle={An Alternative Analytical Approach to Associative Processing (<strong>Best of CAL</strong>)}, \nyear={2018}, \nmonth={July},\ndate={2018-01-03},\nvolume={17}, \nnumber={2}, \npages={113-116}, \nabstract={Associative Processing (AP) is a promising alternative to the Von Neumann model as it addresses the memory wall problem through its inherent in-memory computations. However, because of the countless design parameter choices, comparisons between implementations of two so radically different models are challenging for simulation-based methods. To tackle these challenges, we develop an alternative analytical approach based on a new concept called architecturally-determined complexity. Using this method, we asymptotically evaluate the runtime/storage/energy bounds of the two models, i.e., AP and Von Neumann. We further apply the method to gain more insights into the performance bottlenecks of traditional AP and develop a new machine model named Two Dimensional AP to address these limitations. Finally, we experimentally validate our analytical method and confirm that the simulation results match our theoretical projections.},\nkeywords={journal, Analytical models,Complexity theory,Computational modeling,Computer architecture,Parallel processing,Runtime,Two dimensional displays,Analysis of Algorithms and Problem Complexity,Associative Processors,Modeling techniques,Models of Computation}, \ndoi={10.1109/LCA.2018.2789424}, \nISSN={1556-6056}, \n}\n\n
\n
\n\n\n
\n Associative Processing (AP) is a promising alternative to the Von Neumann model as it addresses the memory wall problem through its inherent in-memory computations. However, because of the countless design parameter choices, comparisons between implementations of two so radically different models are challenging for simulation-based methods. To tackle these challenges, we develop an alternative analytical approach based on a new concept called architecturally-determined complexity. Using this method, we asymptotically evaluate the runtime/storage/energy bounds of the two models, i.e., AP and Von Neumann. We further apply the method to gain more insights into the performance bottlenecks of traditional AP and develop a new machine model named Two Dimensional AP to address these limitations. Finally, we experimentally validate our analytical method and confirm that the simulation results match our theoretical projections.\n
\n\n\n
\n\n\n
\n
\n\n \n \n \n \n \n PQ-CNN: Accelerating Product Quantized Convolutional Neural Network (Poster).\n \n \n \n\n\n \n ZhangS, J.; and Li, J.\n\n\n \n\n\n\n In
2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), April 2018. \n
\n\n
\n\n
\n\n
\n\n \n\n \n \n doi\n \n \n\n \n link\n \n \n\n bibtex\n \n\n \n\n \n\n \n \n \n \n \n \n \n\n \n \n \n \n \n\n\n\n
\n
@inproceedings{zhang2018fccm,\n author = {Zhang<sup>S</sup>, Jialiang and Li, Jing},\n title = {{PQ-CNN}: {Accelerating} Product Quantized Convolutional Neural Network (Poster)},\n booktitle = {2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (<strong>FCCM</strong>)},\n year = {2018},\n month = {April},\n date={2018-04-29},\n doi={10.1109/FCCM.2018.00041},\n %pubstate = {forthcoming},\n note = {},\n keywords = {conference}\n}\n\n
\n
\n\n\n\n
\n\n\n
\n
\n\n \n \n \n \n \n Efficient Large-scale Approximate Nearest Neighbor Search on the OpenCL-FPGA.\n \n \n \n\n\n \n ZhangS, J.; KhoramS, S.; and Li, J.\n\n\n \n\n\n\n In
Conference on Computer Vision and Pattern Recognition (CVPR), pages 4924–4932, Jun 2018. \n
(Acceptance Rate: 29%, 979 out of over 3300)\n\n
\n\n
\n\n
\n\n \n\n \n \n doi\n \n \n\n \n link\n \n \n\n bibtex\n \n\n \n \n \n abstract \n \n\n \n \n \n 2 downloads\n \n \n\n \n \n \n \n \n \n \n\n \n \n \n \n \n\n\n\n
\n
@inproceedings{zhang2018cvpr,\n author = {Zhang<sup>S</sup>, Jialiang and Khoram<sup>S</sup>, Soroosh and Li, Jing},\n title = {Efficient Large-scale Approximate Nearest Neighbor Search on the {OpenCL-FPGA}},\n booktitle = {Conference on Computer Vision and Pattern Recognition (<strong>CVPR</strong>)},\n year = {2018},\n month={Jun},\n date={2018-06},\n %pubstate = {forthcoming},\n abstract={We present a new method for Product Quantization (PQ) based approximated nearest neighbor search (ANN) in high dimensional spaces. Specifically, we first propose a quantization scheme for the codebook of coarse quantizer, product quantizer, and rotation matrix, to reduce the cost of accessing these codebooks. Our approach also combines a highly parallel k-selection method, which can be fused with the distance calculation to reduce the memory overhead. We implement the proposed method on Intel HARPv2 platform using OpenCL-FPGA. The proposed method significantly outperforms state-of-the-art methods on CPU and GPU for high dimensional nearest neighbor queries on billion-scale datasets in terms of query time and accuracy regardless of the batch size. To our best knowledge, this is the first work to demonstrate FPGA performance superior to CPU and GPU on high-dimensional, large-scale ANN datasets.},\n doi={10.1109/CVPR.2018.00517},\n pages={4924--4932}, \n note = {(Acceptance Rate: <u>29\\%</u>, 979 out of over 3300)},\n keywords = {conference}\n}\n\n
\n
\n\n\n
\n We present a new method for Product Quantization (PQ) based approximated nearest neighbor search (ANN) in high dimensional spaces. Specifically, we first propose a quantization scheme for the codebook of coarse quantizer, product quantizer, and rotation matrix, to reduce the cost of accessing these codebooks. Our approach also combines a highly parallel k-selection method, which can be fused with the distance calculation to reduce the memory overhead. We implement the proposed method on Intel HARPv2 platform using OpenCL-FPGA. The proposed method significantly outperforms state-of-the-art methods on CPU and GPU for high dimensional nearest neighbor queries on billion-scale datasets in terms of query time and accuracy regardless of the batch size. To our best knowledge, this is the first work to demonstrate FPGA performance superior to CPU and GPU on high-dimensional, large-scale ANN datasets.\n
\n\n\n
\n\n\n
\n
\n\n \n \n \n \n \n \n Adaptive Quantization of Neural Networks.\n \n \n \n \n\n\n \n KhoramS, S.; and Li, J.\n\n\n \n\n\n\n In
International Conference on Learning Representations (ICLR), April 2018. \n
\n\n
\n\n
\n\n
\n\n \n \n
Paper\n \n \n\n \n\n \n link\n \n \n\n bibtex\n \n\n \n \n \n abstract \n \n\n \n\n \n \n \n \n \n \n \n\n \n \n \n \n \n\n\n\n
\n
@inproceedings{Khoram2018iclr,\n author = {Khoram<sup>S</sup>, Soroosh and Li, Jing},\n title = {Adaptive Quantization of Neural Networks},\n booktitle = {International Conference on Learning Representations (<strong>ICLR</strong>)},\n year = {2018},\n month={April},\n date={2018-04},\n abstract={Despite the state-of-the-art accuracy of Deep Neural Networks (DNN) in various classification problems, their deployment onto resource constrained edge computing devices remains challenging due to their large size and complexity. Several recent studies have reported remarkable results in reducing this complexity through quantization of DNN models. However, these studies usually do not consider the changes in the loss function when performing quantization, nor do they take the different importances of DNN model parameters to the accuracy into account. We address these issues in this paper by proposing a new method, called adaptive quantization, which simplifies a trained DNN model by finding a unique, optimal precision for each network parameter such that the increase in loss is minimized. The optimization problem at the core of this method iteratively uses the loss function gradient to determine an error margin for each parameter and assigns it a precision accordingly. Since this problem uses linear functions, it is computationally cheap and, as we will show, has a closed-form approximate solution. Experiments on MNIST, CIFAR, and SVHN datasets showed that the proposed method can achieve near or better than state-of-the-art reduction in model size with similar error rates. Furthermore, it can achieve compressions close to floating-point model compression methods without loss of accuracy.},\n %pubstate = {forthcoming},\n url={https://openreview.net/forum?id=SyOK1Sg0W},\n keywords = {conference},\n% note = {(Acceptance Rate: <u>34\\%</u>, 314 out of 935)}\n}\n\n
\n
\n\n\n
\n Despite the state-of-the-art accuracy of Deep Neural Networks (DNN) in various classification problems, their deployment onto resource constrained edge computing devices remains challenging due to their large size and complexity. Several recent studies have reported remarkable results in reducing this complexity through quantization of DNN models. However, these studies usually do not consider the changes in the loss function when performing quantization, nor do they take the different importances of DNN model parameters to the accuracy into account. We address these issues in this paper by proposing a new method, called adaptive quantization, which simplifies a trained DNN model by finding a unique, optimal precision for each network parameter such that the increase in loss is minimized. The optimization problem at the core of this method iteratively uses the loss function gradient to determine an error margin for each parameter and assigns it a precision accordingly. Since this problem uses linear functions, it is computationally cheap and, as we will show, has a closed-form approximate solution. Experiments on MNIST, CIFAR, and SVHN datasets showed that the proposed method can achieve near or better than state-of-the-art reduction in model size with similar error rates. Furthermore, it can achieve compressions close to floating-point model compression methods without loss of accuracy.\n
\n\n\n
\n\n\n
\n
\n\n \n \n \n \n \n Nonvolatile Memory Outlook: Technology Driven or Application Driven? (invited).\n \n \n \n\n\n \n Li, J.\n\n\n \n\n\n\n In
2018 China Semiconductor Technology International Conference (CSTIC), pages 1–4, March 2018. \n
\n\n
\n\n
\n\n
\n\n \n\n \n \n doi\n \n \n\n \n link\n \n \n\n bibtex\n \n\n \n\n \n\n \n \n \n \n \n \n \n\n \n \n \n \n \n\n\n\n
\n
@INPROCEEDINGS{li2018CSTIC, \nauthor={Li,Jing}, \nbooktitle={2018 China Semiconductor Technology International Conference (CSTIC)}, \ntitle={Nonvolatile Memory Outlook: Technology Driven or Application Driven? (<strong>invited</strong>)},\nyear={2018}, \ndate = {2018-03-12},\nvolume={}, \nnumber={}, \npages={1--4}, \nISSN={}, \nmonth={March},\ndoi={10.1109/CSTIC.2018.8369201},\n%pubstate = {forthcoming},\nkeywords = {conference}\n}\n\n
\n
\n\n\n\n
\n\n\n
\n
\n\n \n \n \n \n \n \n Liquid Silicon-Monona: A Reconfigurable Memory-Oriented Computing Fabric with Scalable Multi-Context Support.\n \n \n \n \n\n\n \n ZhaS, Y.; and Li, J.\n\n\n \n\n\n\n In
23nd International Conference on Architectural Support for Programming Languages and Operating Systems, volume 53, of
ASPLOS '18, pages 214–228, New York, NY, USA, Mar 2018. ACM\n
(Acceptance Rate: 18.2%, 56 out of 307)\n\n
\n\n
\n\n
\n\n \n \n
Paper\n \n \n\n \n \n doi\n \n \n\n \n link\n \n \n\n bibtex\n \n\n \n \n \n abstract \n \n\n \n \n \n 1 download\n \n \n\n \n \n \n \n \n \n \n\n \n \n \n \n \n\n\n\n
\n
@inproceedings{zha2018asplos,\n author = {Zha<sup>S</sup>, Yue and Li, Jing},\n title = {{Liquid Silicon-Monona}: A Reconfigurable Memory-Oriented Computing Fabric with Scalable Multi-Context Support},\n booktitle = {23nd International Conference on Architectural Support for Programming Languages and Operating Systems},\n series = {<strong>ASPLOS</strong> '18},\n year = {2018},\n month={Mar},\n date={2018-03-19},\n location = {Williamsburg, VA, USA},\n pages = {214--228},\n volume={53},\n issue={2},\n url = {http://doi.acm.org/10.1145/3173162.3173167},\n doi = {10.1145/3173162.3173167},\n acmid = {},\n publisher = {ACM},\n address = {New York, NY, USA},\n keywords = {},\n %pubstate = {accepted},\n abstract={With the recent trend of promoting Field-Programmable Gate Arrays (FPGAs) to first-class citizens in accelerating compute-intensive applications in networking, cloud services and artificial intelligence, FPGAs face two major challenges in sustaining competitive advantages in performance and energy efficiency for diverse cloud workloads: 1) limited configuration capability for supporting light-weight computations/on-chip data storage to accelerate emerging search-/data-intensive applications. 2) lack of architectural support to hide reconfiguration overhead for assisting virtualization in a cloud computing environment. In this paper, we propose a reconfigurable memory-oriented computing fabric, namely Liquid Silicon-Monona (L-Si), enabled by emerging nonvolatile memory technology i.e. RRAM, to address these two challenges. Specifically, L-Si addresses the first challenge by virtue of a new architecture comprising a 2D array of physically identical but functionally-configurable building blocks. It, for the first time, extends the configuration capabilities of existing FPGAs from computation to the whole spectrum ranging from computation to data storage. It allows users to better customize hardware by flexibly partitioning hardware resources between computation and memory, greatly benefiting emerging search- and data-intensive applications. To address the second challenge, L-Si provides scalable multi-context architectural support to minimize reconfiguration overhead for assisting virtualization. In addition, we provide compiler support to facilitate the programming of applications written in high-level programming languages (e.g. OpenCL) and frameworks (e.g. TensorFlow, MapReduce) while fully exploiting the unique architectural capability of L-Si. Our evaluation results show L-Si achieves 99.6\\% area reduction, 1.43× throughput improvement and 94.0\\% power reduction on search-intensive benchmarks, as compared with the FPGA baseline. For neural network benchmarks, on average, L-Si achieves 52.3× speedup, 113.9× energy reduction and 81\\% area reduction over the FPGA baseline. In addition, the multi-context architecture of L-Si reduces the context switching time to - 10ns, compared with an off-the-shelf FPGA (∼100ms), greatly facilitating virtualization.},\n keywords = {conference},\n note = {(Acceptance Rate: <u>18.2\\%</u>, 56 out of 307)}\n} \n\n
\n
\n\n\n
\n With the recent trend of promoting Field-Programmable Gate Arrays (FPGAs) to first-class citizens in accelerating compute-intensive applications in networking, cloud services and artificial intelligence, FPGAs face two major challenges in sustaining competitive advantages in performance and energy efficiency for diverse cloud workloads: 1) limited configuration capability for supporting light-weight computations/on-chip data storage to accelerate emerging search-/data-intensive applications. 2) lack of architectural support to hide reconfiguration overhead for assisting virtualization in a cloud computing environment. In this paper, we propose a reconfigurable memory-oriented computing fabric, namely Liquid Silicon-Monona (L-Si), enabled by emerging nonvolatile memory technology i.e. RRAM, to address these two challenges. Specifically, L-Si addresses the first challenge by virtue of a new architecture comprising a 2D array of physically identical but functionally-configurable building blocks. It, for the first time, extends the configuration capabilities of existing FPGAs from computation to the whole spectrum ranging from computation to data storage. It allows users to better customize hardware by flexibly partitioning hardware resources between computation and memory, greatly benefiting emerging search- and data-intensive applications. To address the second challenge, L-Si provides scalable multi-context architectural support to minimize reconfiguration overhead for assisting virtualization. In addition, we provide compiler support to facilitate the programming of applications written in high-level programming languages (e.g. OpenCL) and frameworks (e.g. TensorFlow, MapReduce) while fully exploiting the unique architectural capability of L-Si. Our evaluation results show L-Si achieves 99.6% area reduction, 1.43× throughput improvement and 94.0% power reduction on search-intensive benchmarks, as compared with the FPGA baseline. For neural network benchmarks, on average, L-Si achieves 52.3× speedup, 113.9× energy reduction and 81% area reduction over the FPGA baseline. In addition, the multi-context architecture of L-Si reduces the context switching time to - 10ns, compared with an off-the-shelf FPGA (∼100ms), greatly facilitating virtualization.\n
\n\n\n
\n\n\n
\n
\n\n \n \n \n \n \n \n Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform.\n \n \n \n \n\n\n \n ZhangS, J.; and Li, J.\n\n\n \n\n\n\n In
Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, of
FPGA '18, pages 229–238, New York, NY, USA, Feb 2018. ACM\n
(Acceptance Rate*: 24%)\n\n
\n\n
\n\n
\n\n \n \n
Paper\n \n \n\n \n \n doi\n \n \n\n \n link\n \n \n\n bibtex\n \n\n \n \n \n abstract \n \n\n \n \n \n 1 download\n \n \n\n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n\n\n\n
\n
@inproceedings{zhang2018fpga,\n author = {Zhang<sup>S</sup>, Jialiang and Li, Jing},\n title = {Degree-aware Hybrid Graph Traversal on {FPGA-HMC} Platform},\n booktitle = {Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays},\n series = {<strong>FPGA</strong> '18},\n year = {2018},\n month={Feb},\n date={2018-02-25},\n pages = {229--238},\n location = {Monterey, California, USA},\n publisher = {ACM},\n address = {New York, NY, USA},\n url = {http://doi.acm.org/10.1145/3174243.3174245},\n doi = {10.1145/3174243.3174245},\n keywords = {conference, graph processor, hybrid memory cube, bfs},\n abstract={Graph traversal is a core primitive for graph analytics and a basis for many higher-level graph analysis methods. However, irregularities in the structure of scale-free graphs (e.g., social network) limit our ability to analyze these important and growing datasets. A key challenge is the redundant graph computations caused by the presence of high-degree vertices which not only increase the total amount of computations but also incur unnecessary random data access. In this paper, we present a graph processing system on an FPGA-HMC platform, based on software/hardware co-design and co- optimization. For the first time, we leverage the inherent graph property i.e. vertex degree to co-optimize algorithm and hardware architecture. In particular, we first develop two algorithm optimization techniques:degree-aware adjacency list reordering anddegree-aware vertex index sorting. The former can reduce the number of redundant graph computations, while the latter can create a strong correlation between vertex index and data access frequency, which can be effectively applied to guide the hardware design. We further implement the optimized hybrid graph traversal algorithm on an FPGA-HMC platform. By leveraging the strong correlation between vertex index and data access frequency made by degree-aware vertex index sorting, we develop two platform-dependent hardware optimization techniques, namely degree-aware data placement and degree-aware adjacency list compression. These two techniques together substantially reduce the amount of access to external memory. Finally, we conduct extensive experiments on an FPGA-HMC platform to verify the effectiveness of the proposed techniques. To the best of our knowledge, our implementation achieves the highest performance (45.8 billion traversed edges per second) among existing FPGA-based graph processing systems.},\n %pubstate = {accepted},\n note = {(Acceptance Rate*: <u>24\\%</u>)}\n} \n\n
\n
\n\n\n
\n Graph traversal is a core primitive for graph analytics and a basis for many higher-level graph analysis methods. However, irregularities in the structure of scale-free graphs (e.g., social network) limit our ability to analyze these important and growing datasets. A key challenge is the redundant graph computations caused by the presence of high-degree vertices which not only increase the total amount of computations but also incur unnecessary random data access. In this paper, we present a graph processing system on an FPGA-HMC platform, based on software/hardware co-design and co- optimization. For the first time, we leverage the inherent graph property i.e. vertex degree to co-optimize algorithm and hardware architecture. In particular, we first develop two algorithm optimization techniques:degree-aware adjacency list reordering anddegree-aware vertex index sorting. The former can reduce the number of redundant graph computations, while the latter can create a strong correlation between vertex index and data access frequency, which can be effectively applied to guide the hardware design. We further implement the optimized hybrid graph traversal algorithm on an FPGA-HMC platform. By leveraging the strong correlation between vertex index and data access frequency made by degree-aware vertex index sorting, we develop two platform-dependent hardware optimization techniques, namely degree-aware data placement and degree-aware adjacency list compression. These two techniques together substantially reduce the amount of access to external memory. Finally, we conduct extensive experiments on an FPGA-HMC platform to verify the effectiveness of the proposed techniques. To the best of our knowledge, our implementation achieves the highest performance (45.8 billion traversed edges per second) among existing FPGA-based graph processing systems.\n
\n\n\n
\n\n\n
\n
\n\n \n \n \n \n \n \n Liquid Silicon: A Data-Centric Reconfigurable Architecture enabled by RRAM Technology.\n \n \n \n \n\n\n \n ZhaS, Y.; and Li, J.\n\n\n \n\n\n\n In
Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, of
FPGA '18, pages 51–60, New York, NY, USA, Feb 2018. ACM\n
(Acceptance Rate*: 24%, Ranked #1 among 100+ submissions)\n\n
\n\n
\n\n
\n\n \n \n
Paper\n \n \n\n \n \n doi\n \n \n\n \n link\n \n \n\n bibtex\n \n\n \n \n \n abstract \n \n\n \n\n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n\n\n
\n
@inproceedings{zha2018fpga,\n author = {Zha<sup>S</sup>, Yue and Li, Jing},\n title = {{Liquid Silicon}: A Data-Centric Reconfigurable Architecture enabled by {RRAM} Technology},\n booktitle = {Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays},\n series = {<strong>FPGA</strong> '18},\n year = {2018},\n month={Feb},\n date={2018-02-25},\n pages = {51--60},\n url = {http://doi.acm.org/10.1145/3174243.3174244},\n doi = {10.1145/3174243.3174244},\n location = {Monterey, California, USA},\n publisher = {ACM},\n address = {New York, NY, USA},\n keywords = {conference, monolithic stacking, non-volatile memory, processing-in-memory, reconfigurable architecture, tcam},\n abstract={This paper presents a data-centric reconfigurable architecture, namely Liquid Silicon, enabled by emerging non-volatile memory, i.e., RRAM. Compared to the heterogeneous architecture of commercial FPGAs, Liquid Silicon is inherently a homogeneous architecture comprising a two-dimensional (2D) array of identical 'tiles'. Each tile can be configured into one or a combination of four modes: TCAM, logic, interconnect, and memory. Such flexibility allows users to partition resources based on applications? needs, in contrast to the fixed hardware design using dedicated hard IP blocks in FPGAs. In addition to better resource usage, its 'memory friendly' architecture effectively addresses the limitations of commercial FPGAs i.e., scarce on-chip memory resources, making it an effective complement to FPGAs. Moreover, its coarse-grained logic implementation results in shallower logic depth, less inter-tile routing overhead, and thus smaller area and better performance, compared with its FPGA counterpart. Our study shows that, on average, for both traditional and emerging applications, we achieve 62\\% area reduction, 27\\% speedup and 31\\% improvement in energy efficiency when mapping applications onto Liquid Silicon instead of FPGAs.},\n %pubstate = {accepted},\n note = {(Acceptance Rate*: <u>24\\%</u>, Ranked <strong>\\#1</strong> among 100+ submissions)}\n} \n\n\n
\n
\n\n\n
\n This paper presents a data-centric reconfigurable architecture, namely Liquid Silicon, enabled by emerging non-volatile memory, i.e., RRAM. Compared to the heterogeneous architecture of commercial FPGAs, Liquid Silicon is inherently a homogeneous architecture comprising a two-dimensional (2D) array of identical 'tiles'. Each tile can be configured into one or a combination of four modes: TCAM, logic, interconnect, and memory. Such flexibility allows users to partition resources based on applications? needs, in contrast to the fixed hardware design using dedicated hard IP blocks in FPGAs. In addition to better resource usage, its 'memory friendly' architecture effectively addresses the limitations of commercial FPGAs i.e., scarce on-chip memory resources, making it an effective complement to FPGAs. Moreover, its coarse-grained logic implementation results in shallower logic depth, less inter-tile routing overhead, and thus smaller area and better performance, compared with its FPGA counterpart. Our study shows that, on average, for both traditional and emerging applications, we achieve 62% area reduction, 27% speedup and 31% improvement in energy efficiency when mapping applications onto Liquid Silicon instead of FPGAs.\n
\n\n\n
\n\n\n
\n
\n\n \n \n \n \n \n \n Accelerating Graph Analytics By Co-Optimizing Storage and Access on an FPGA-HMC Platform.\n \n \n \n \n\n\n \n KhoramS, S.; ZhangS, J.; StrangeS, M.; and Li, J.\n\n\n \n\n\n\n In
Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, of
FPGA '18, pages 239–248, New York, NY, USA, Feb 2018. ACM\n
(Acceptance Rate*: 24%)\n\n
\n\n
\n\n
\n\n \n \n
Paper\n \n \n\n \n \n doi\n \n \n\n \n link\n \n \n\n bibtex\n \n\n \n \n \n abstract \n \n\n \n\n \n \n \n \n \n \n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n\n\n
\n
@inproceedings{Khoram2018fpga,\n author = {Khoram<sup>S</sup>, Soroosh and Zhang<sup>S</sup>, Jialiang and Strange<sup>S</sup>, Maxwell and Li, Jing},\n title = {Accelerating Graph Analytics By Co-Optimizing Storage and Access on an {FPGA-HMC} Platform},\n booktitle = {Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays},\n series = {<strong>FPGA</strong> '18},\n year = {2018},\n month={Feb},\n date={2018-02-25},\n pages = {239--248},\n url = {http://doi.acm.org/10.1145/3174243.3174260},\n doi = {10.1145/3174243.3174260},\n location = {Monterey, California, USA},\n publisher = {ACM},\n address = {New York, NY, USA},\n abstract={Graph analytics, which explores the relationships among interconnected entities, is becoming increasingly important due to its broad applicability, from machine learning to social sciences. However, due to the irregular data access patterns in graph computations, one major challenge for graph processing systems is performance. The algorithms, softwares, and hardwares that have been tailored for mainstream parallel applications are generally not effective for massive, sparse graphs from the real-world problems, due to their complex and irregular structures. To address the performance issues in large-scale graph analytics, we leverage the exceptional random access performance of the emerging Hybrid Memory Cube (HMC) combined with the flexibility and efficiency of modern FPGAs. In particular, we develop a collaborative software/hardware technique to perform a level-synchronized Breadth First Search (BFS) on a FPGA-HMC platform. From the software perspective, we develop an architecture-aware graph clustering algorithm that exploits the FPGA-HMC platform»s capability to improve data locality and memory access efficiency. From the hardware perspective, we further improve the FPGA-HMC graph processor architecture by designing a memory request merging unit to take advantage of the increased data locality resulting from graph clustering. We evaluate the performance of our BFS implementation using the AC-510 development kit from Micron and achieve $2.8 \\times$ average performance improvement compared to the latest FPGA-HMC based graph processing system over a set of benchmarks from a wide range of applications.},\n keywords = {conference, graph analytics, graph clustering, hardware accelerators, hybrid memory cube, reconfigurable logic, bfs},\n %pubstate = {accepted},\n note = {(Acceptance Rate*: <u>24\\%</u>)}\n}\n\n
\n
\n\n\n
\n Graph analytics, which explores the relationships among interconnected entities, is becoming increasingly important due to its broad applicability, from machine learning to social sciences. However, due to the irregular data access patterns in graph computations, one major challenge for graph processing systems is performance. The algorithms, softwares, and hardwares that have been tailored for mainstream parallel applications are generally not effective for massive, sparse graphs from the real-world problems, due to their complex and irregular structures. To address the performance issues in large-scale graph analytics, we leverage the exceptional random access performance of the emerging Hybrid Memory Cube (HMC) combined with the flexibility and efficiency of modern FPGAs. In particular, we develop a collaborative software/hardware technique to perform a level-synchronized Breadth First Search (BFS) on a FPGA-HMC platform. From the software perspective, we develop an architecture-aware graph clustering algorithm that exploits the FPGA-HMC platform»s capability to improve data locality and memory access efficiency. From the hardware perspective, we further improve the FPGA-HMC graph processor architecture by designing a memory request merging unit to take advantage of the increased data locality resulting from graph clustering. We evaluate the performance of our BFS implementation using the AC-510 development kit from Micron and achieve $2.8 ×$ average performance improvement compared to the latest FPGA-HMC based graph processing system over a set of benchmarks from a wide range of applications.\n
\n\n\n
\n\n\n\n\n\n