Why-Diff: Exploiting Provenance to Understand Outcome Differences from non-identical Reproduced Workflows. Thavasimani, P., Cala, J., & Missier, P. IEEE Access, 2019.
Why-Diff: Exploiting Provenance to Understand Outcome Differences from non-identical Reproduced Workflows [link]Paper  doi  abstract   bibtex   5 downloads  
Data analytics processes such as scientific workflows tend to be executed repeatedly, with varying dependencies and input datasets. The case has been made in the past for tracking the provenance of the final information products through the workflow steps, to enable their reproducibility. In this work, we explore the hypothesis that provenance traces recorded during execution are also instrumental to answering questions about the observed differences between sets of results obtained from similar but not identical workflow configurations. Such differences in configurations may be introduced deliberately, i.e., to explore process variations, or accidentally, typically as the result of porting efforts or of changes in the computing environment. Using a commonly used workflow programming model as a reference, we consider both structural variations in the workflows as well as variations within their individual components. Our whydiff algorithm compares the graph representations of two provenance traces derived from two workflow variations. It produces a delta graph that can be used to produce human-readable explanations of the impact of workflow differences on observed output differences. We report on our Neo4j graph database. We also report explanations of difference between workflow results using a suite of synthetic workflows as well as real-world workflows.
@article{thavasimani_why-diff_2019,
	title = {Why-{Diff}: {Exploiting} {Provenance} to {Understand} {Outcome} {Differences} from non-identical {Reproduced} {Workflows}},
	issn = {2169-3536},
	url = {https://ieeexplore.ieee.org/document/8662612/},
	doi = {10.1109/ACCESS.2019.2903727},
	abstract = {Data analytics processes such as scientific workflows tend to be executed repeatedly, with varying dependencies and input datasets. The case has been made in the past for tracking the provenance of the final information products through the workflow steps, to enable their reproducibility. In this work, we explore the hypothesis that provenance traces recorded during execution are also instrumental to answering questions about the observed differences between sets of results obtained from similar but not identical workflow configurations. Such differences in configurations may be introduced deliberately, i.e., to explore process variations, or accidentally, typically as the result of porting efforts or of changes in the computing environment. Using a commonly used workflow programming model as a reference, we consider both structural variations in the workflows as well as variations within their individual components. Our whydiff algorithm compares the graph representations of two provenance traces derived from two workflow variations. It produces a delta graph that can be used to produce human-readable explanations of the impact of workflow differences on observed output differences. We report on our Neo4j graph database. We also report explanations of difference between workflow results using a suite of synthetic workflows as well as real-world workflows.},
	journal = {IEEE Access},
	author = {Thavasimani, Priyaa and Cala, Jacek and Missier, Paolo},
	year = {2019},
	keywords = {eScience Central, Provenance, Big Data, Why-Diff, Workflow, Reproducibility, Software, Alzheimer's disease, Databases, Genetics, Libraries, Sentiment analysis},
	pages = {1--1},
}

Downloads: 5