. Costea, A., Ionescu, A., Raducanu, B., Świtakowski, M., Bârca, C., Sompolski, J., Łuszczak, A., Szafrański, M., De Nijs, G., & Boncz, P. Volume 26-June-2016. VectorH: Taking SQL-on-Hadoop to the next level, pages 1105–1117. Association for Computing Machinery (ACM), 6, 2016.
doi  abstract   bibtex   
Actian Vector in Hadoop (VectorH for short) is a new SQL-on-Hadoop system built on top of the fast Vectorwise analytical database system. VectorH achieves fault tolerance and storage scalability by relying on HDFS, and extends the state-of-the-art in SQL-on-Hadoop systems by instrumenting the HDFS replication policy to optimize read locality. VectorH integrates with YARN for workload management, achieving a high degree of elasticity. Even though HDFS is an append-only file-system, and VectorH supports (update-averse) ordered tables, trickle updates are possible thanks to Positional Delta Trees (PDTs), a diffferential update structure that can be queried efficiently. We describe the changes made to single-server Vectorwise to turn it into a Hadoop-based MPP system, encompassing workload management, parallel query optimization and execution, HDFS storage, transaction processing and Spark integration. We evaluate VectorH against HAWQ, Impala, SparkSQL and Hive, showing orders of magnitude better performance.
@inbook{81f07f72c43d466a8c80eb41d581cc13,
  title     = "VectorH: Taking SQL-on-Hadoop to the next level",
  abstract  = "Actian Vector in Hadoop (VectorH for short) is a new SQL-on-Hadoop system built on top of the fast Vectorwise analytical database system. VectorH achieves fault tolerance and storage scalability by relying on HDFS, and extends the state-of-the-art in SQL-on-Hadoop systems by instrumenting the HDFS replication policy to optimize read locality. VectorH integrates with YARN for workload management, achieving a high degree of elasticity. Even though HDFS is an append-only file-system, and VectorH supports (update-averse) ordered tables, trickle updates are possible thanks to Positional Delta Trees (PDTs), a diffferential update structure that can be queried efficiently. We describe the changes made to single-server Vectorwise to turn it into a Hadoop-based MPP system, encompassing workload management, parallel query optimization and execution, HDFS storage, transaction processing and Spark integration. We evaluate VectorH against HAWQ, Impala, SparkSQL and Hive, showing orders of magnitude better performance.",
  author    = "Andrei Costea and Adrian Ionescu and Bogdan Raducanu and Michał Świtakowski and Cristian Bârca and Juliusz Sompolski and Alicja Łuszczak and Michał Szafrański and {De Nijs}, Giel and Peter Boncz",
  year      = "2016",
  month     = "6",
  doi       = "10.1145/2882903.2903742",
  volume    = "26-June-2016",
  pages     = "1105--1117",
  booktitle = "SIGMOD 2016 - Proceedings of the 2016 International Conference on Management of Data",
  publisher = "Association for Computing Machinery (ACM)",
}

Downloads: 0