Design and Development of a Provenance Capture Platform for Data Science. Gregori, L., Missier, P., Stidolph, M., Torlone, r., & Wood, A. In Procs. 3rd DATAPLAT workshop, co-located with ICDE 2024, Utrecht, NL, May, 2024. IEEE.
Design and Development of a Provenance Capture Platform for Data Science [link]Paper  abstract   bibtex   3 downloads  
As machine learning and AI systems become more prevalent, understanding how their decisions are made is key to maintaining their trust. To solve this problem, it is widely accepted that fundamental support can be provided by the knowledge of how data are altered in the pre-processing phase, using data provenance to track such changes. This paper focuses on the design and development of a system for collecting and managing data provenance of data preparation pipelines in data science. An investigation of publicly available machine learning pipelines is conducted to identify the most important features required for the tool to achieve impact on a broad selection of pre-processing data manipulation. This reveals that the operations that are used in practice can be implemented by combining a rather limited set of basic operators. We then illustrate and test implementation choices aimed at supporting the provenance capture for those operations efficiently and with minimal effort for data scientists.
@inproceedings{gregori_design_2024,
	address = {Utrecht, NL},
	title = {Design and {Development} of a {Provenance} {Capture} {Platform} for {Data} {Science}},
	abstract = {As machine learning and AI systems become more prevalent, understanding how their decisions are made is key to maintaining their trust. To solve this problem, it is widely accepted that fundamental support can be provided by the knowledge of how data are altered in the pre-processing phase, using data provenance to track such changes.
This paper focuses on the design and development of a system for collecting and managing data provenance of data preparation pipelines in data science. An investigation of publicly available machine learning pipelines is conducted to identify the most important features required for the tool to achieve impact on a broad selection of pre-processing data manipulation. This reveals that the operations that are used in practice can be implemented by combining a rather limited set of basic operators. We then illustrate and test implementation choices aimed at supporting the provenance capture for those operations efficiently and with minimal effort for data scientists.},
	booktitle = {Procs. 3rd {DATAPLAT} workshop, co-located with {ICDE} 2024},
	publisher = {IEEE},
	author = {Gregori, Luca and Missier, Paolo and Stidolph, Matthew and Torlone, riccardo and Wood, Alessandro},
	month = may,
	year = {2024},
	url={https://www.dropbox.com/scl/fi/plz8egd5wdvb5bp5vra09/840300a285.pdf?rlkey=gitqo6jzveh915g9fhbsqpqyn&st=8pk9vluh&dl=0}
}

Downloads: 3