Big provenance stream processing for data intensive computations. Suriarachchi, I., Withana, S., & Plale, B. In Proceedings - IEEE 14th International Conference on eScience, e-Science 2018, pages 245-255, 12, 2018. Institute of Electrical and Electronics Engineers Inc..
doi  abstract   bibtex   
In the business and research landscape of today, data analysis consumes public and proprietary data from numerous sources, and utilizes any one or more of popular data-parallel frameworks such as Hadoop, Spark and Flink. In the Data Lake setting these frameworks co-exist. Our earlier work has shown that data provenance in Data Lakes can aid with both traceability and management. The sheer volume of fine-grained provenance generated in a multi-framework application motivates the need for on-the-fly provenance processing. We introduce a new parallel stream processing algorithm that reduces fine-grained provenance while preserving backward and forward provenance. The algorithm is resilient to provenance events arriving out-of-order. It is evaluated using several strategies for partitioning a provenance stream. The evaluation shows that the parallel algorithm performs well in processing out-of-order provenance streams, with good scalability and accuracy.
@inproceedings{
 title = {Big provenance stream processing for data intensive computations},
 type = {inproceedings},
 year = {2018},
 keywords = {Big Data,Big Provenance,Stream Processing},
 pages = {245-255},
 month = {12},
 publisher = {Institute of Electrical and Electronics Engineers Inc.},
 day = {24},
 id = {02faffe9-74a8-370f-9c62-67aef6795cd5},
 created = {2020-04-22T21:44:56.973Z},
 accessed = {2020-04-21},
 file_attached = {false},
 profile_id = {42d295c0-0737-38d6-8b43-508cab6ea85d},
 last_modified = {2020-05-11T14:43:32.831Z},
 read = {false},
 starred = {false},
 authored = {true},
 confirmed = {false},
 hidden = {false},
 citation_key = {Suriarachchi2018},
 private_publication = {false},
 abstract = {In the business and research landscape of today, data analysis consumes public and proprietary data from numerous sources, and utilizes any one or more of popular data-parallel frameworks such as Hadoop, Spark and Flink. In the Data Lake setting these frameworks co-exist. Our earlier work has shown that data provenance in Data Lakes can aid with both traceability and management. The sheer volume of fine-grained provenance generated in a multi-framework application motivates the need for on-the-fly provenance processing. We introduce a new parallel stream processing algorithm that reduces fine-grained provenance while preserving backward and forward provenance. The algorithm is resilient to provenance events arriving out-of-order. It is evaluated using several strategies for partitioning a provenance stream. The evaluation shows that the parallel algorithm performs well in processing out-of-order provenance streams, with good scalability and accuracy.},
 bibtype = {inproceedings},
 author = {Suriarachchi, Isuru and Withana, Sachith and Plale, Beth},
 doi = {10.1109/eScience.2018.00039},
 booktitle = {Proceedings - IEEE 14th International Conference on eScience, e-Science 2018}
}

Downloads: 0