PROLIT: Supporting the Transparency of Data Preparation Pipelines through Narratives over Data Provenance

PROLIT: Supporting the Transparency of Data Preparation Pipelines through Narratives over Data Provenance. Lazzaro, P. L., Lazzaro, M., Missier, P., & Torlone, R. In Procs. EDBT (Demo track), Barcelona, Spain, 2025. OpenProceedings.org.

Paper doi abstract bibtex 1 download

Establishing trust in the models is a long-standing objective in Machine Learning and AI. Information on how data are manipulated before being used for training is instrumental in such understanding, and data provenance can be used to organize and navigate such information. The PROLIT system described in this demo paper is designed to collect, manage, and query the provenance of data as it flows through data preparation pipelines in support of data science analytics and machine learning modeling. PROLIT extends our prior work on transparently collecting data provenance in several directions. Most notably, it employs a LLM to: (i) automatically rewrite user-defined pipelines in a format suitable for this activity, (ii) segment the code to precisely associate provenance to each code snippet, and (iii) provide human-readable descriptions of each snippet that in turn can be used to generate provenance narratives. The demo will showcase these capabilities and offer the opportunity to interact with PROLIT on user-defined as well as pre-defined Python code where dataframes are used as the common data abstraction.

@inproceedings{lazzaro_prolit_2025,
	address = {Barcelona, Spain},
	title = {{PROLIT}: {Supporting} the {Transparency} of {Data} {Preparation} {Pipelines} through {Narratives} over {Data} {Provenance}},
	shorttitle = {{PROLIT}},
	url = {https://openproceedings.org/2025/conf/edbt/paper-336.pdf},
	doi = {10.48786/EDBT.2025.108},
	abstract = {Establishing trust in the models is a long-standing objective in Machine Learning and AI. Information on how data are manipulated before being used for training is instrumental in such understanding, and data provenance can be used to organize and navigate such information. The PROLIT system described in this demo paper is designed to collect, manage, and query the provenance of data as it flows through data preparation pipelines in support of data science analytics and machine learning modeling. PROLIT extends our prior work on transparently collecting data provenance in several directions. Most notably, it employs a LLM to: (i) automatically rewrite user-defined pipelines in a format suitable for this activity, (ii) segment the code to precisely associate provenance to each code snippet, and (iii) provide human-readable descriptions of each snippet that in turn can be used to generate provenance narratives. The demo will showcase these capabilities and offer the opportunity to interact with PROLIT on user-defined as well as pre-defined Python code where dataframes are used as the common data abstraction.},
	language = {en},
	urldate = {2025-04-16},
	booktitle = {Procs. {EDBT} ({Demo} track)},
	publisher = {OpenProceedings.org},
	author = {Lazzaro, Pasquale Leonardo and Lazzaro, Marialaura and Missier, Paolo and Torlone, Riccardo},
	year = {2025},
	keywords = {Database Technology},
	file = {Lazzaro et al. - 2025 - PROLIT Supporting the Transparency of Data Prepar.pdf:/Users/npm65/Zotero/storage/2MABHSSL/Lazzaro et al. - 2025 - PROLIT Supporting the Transparency of Data Prepar.pdf:application/pdf},
}

Downloads: 1

{"_id":"5AF62fsS89Hcv8yPb","bibbaseid":"lazzaro-lazzaro-missier-torlone-prolitsupportingthetransparencyofdatapreparationpipelinesthroughnarrativesoverdataprovenance-2025","author_short":["Lazzaro, P. L.","Lazzaro, M.","Missier, P.","Torlone, R."],"bibdata":{"bibtype":"inproceedings","type":"inproceedings","address":"Barcelona, Spain","title":"PROLIT: Supporting the Transparency of Data Preparation Pipelines through Narratives over Data Provenance","shorttitle":"PROLIT","url":"https://openproceedings.org/2025/conf/edbt/paper-336.pdf","doi":"10.48786/EDBT.2025.108","abstract":"Establishing trust in the models is a long-standing objective in Machine Learning and AI. Information on how data are manipulated before being used for training is instrumental in such understanding, and data provenance can be used to organize and navigate such information. The PROLIT system described in this demo paper is designed to collect, manage, and query the provenance of data as it flows through data preparation pipelines in support of data science analytics and machine learning modeling. PROLIT extends our prior work on transparently collecting data provenance in several directions. Most notably, it employs a LLM to: (i) automatically rewrite user-defined pipelines in a format suitable for this activity, (ii) segment the code to precisely associate provenance to each code snippet, and (iii) provide human-readable descriptions of each snippet that in turn can be used to generate provenance narratives. The demo will showcase these capabilities and offer the opportunity to interact with PROLIT on user-defined as well as pre-defined Python code where dataframes are used as the common data abstraction.","language":"en","urldate":"2025-04-16","booktitle":"Procs. EDBT (Demo track)","publisher":"OpenProceedings.org","author":[{"propositions":[],"lastnames":["Lazzaro"],"firstnames":["Pasquale","Leonardo"],"suffixes":[]},{"propositions":[],"lastnames":["Lazzaro"],"firstnames":["Marialaura"],"suffixes":[]},{"propositions":[],"lastnames":["Missier"],"firstnames":["Paolo"],"suffixes":[]},{"propositions":[],"lastnames":["Torlone"],"firstnames":["Riccardo"],"suffixes":[]}],"year":"2025","keywords":"Database Technology","file":"Lazzaro et al. - 2025 - PROLIT Supporting the Transparency of Data Prepar.pdf:/Users/npm65/Zotero/storage/2MABHSSL/Lazzaro et al. - 2025 - PROLIT Supporting the Transparency of Data Prepar.pdf:application/pdf","bibtex":"@inproceedings{lazzaro_prolit_2025,\n\taddress = {Barcelona, Spain},\n\ttitle = {{PROLIT}: {Supporting} the {Transparency} of {Data} {Preparation} {Pipelines} through {Narratives} over {Data} {Provenance}},\n\tshorttitle = {{PROLIT}},\n\turl = {https://openproceedings.org/2025/conf/edbt/paper-336.pdf},\n\tdoi = {10.48786/EDBT.2025.108},\n\tabstract = {Establishing trust in the models is a long-standing objective in Machine Learning and AI. Information on how data are manipulated before being used for training is instrumental in such understanding, and data provenance can be used to organize and navigate such information. The PROLIT system described in this demo paper is designed to collect, manage, and query the provenance of data as it flows through data preparation pipelines in support of data science analytics and machine learning modeling. PROLIT extends our prior work on transparently collecting data provenance in several directions. Most notably, it employs a LLM to: (i) automatically rewrite user-defined pipelines in a format suitable for this activity, (ii) segment the code to precisely associate provenance to each code snippet, and (iii) provide human-readable descriptions of each snippet that in turn can be used to generate provenance narratives. The demo will showcase these capabilities and offer the opportunity to interact with PROLIT on user-defined as well as pre-defined Python code where dataframes are used as the common data abstraction.},\n\tlanguage = {en},\n\turldate = {2025-04-16},\n\tbooktitle = {Procs. {EDBT} ({Demo} track)},\n\tpublisher = {OpenProceedings.org},\n\tauthor = {Lazzaro, Pasquale Leonardo and Lazzaro, Marialaura and Missier, Paolo and Torlone, Riccardo},\n\tyear = {2025},\n\tkeywords = {Database Technology},\n\tfile = {Lazzaro et al. - 2025 - PROLIT Supporting the Transparency of Data Prepar.pdf:/Users/npm65/Zotero/storage/2MABHSSL/Lazzaro et al. - 2025 - PROLIT Supporting the Transparency of Data Prepar.pdf:application/pdf},\n}\n\n\n","author_short":["Lazzaro, P. L.","Lazzaro, M.","Missier, P.","Torlone, R."],"key":"lazzaro_prolit_2025","id":"lazzaro_prolit_2025","bibbaseid":"lazzaro-lazzaro-missier-torlone-prolitsupportingthetransparencyofdatapreparationpipelinesthroughnarrativesoverdataprovenance-2025","role":"author","urls":{"Paper":"https://openproceedings.org/2025/conf/edbt/paper-336.pdf"},"keyword":["Database Technology"],"metadata":{"authorlinks":{}},"downloads":1},"bibtype":"inproceedings","biburl":"https://bibbase.org/f/MTSG9SdhWPisKNpZX/MyPublications-bibbase.bib","dataSources":["ze2X9uz8Dcv2oGipf","afppXLgSuddAzAL9e","wJE4ynGem9MRsXBRn"],"keywords":["database technology"],"search_terms":["prolit","supporting","transparency","data","preparation","pipelines","through","narratives","over","data","provenance","lazzaro","lazzaro","missier","torlone"],"title":"PROLIT: Supporting the Transparency of Data Preparation Pipelines through Narratives over Data Provenance","year":2025,"downloads":1}