PROLIT: Supporting the Transparency of Data Preparation Pipelines through Narratives over Data Provenance. Lazzaro, P. L., Lazzaro, M., Missier, P., & Torlone, R. In Procs. EDBT (Demo track), Barcelona, Spain, 2025. OpenProceedings.org.
PROLIT: Supporting the Transparency of Data Preparation Pipelines through Narratives over Data Provenance [pdf]Paper  doi  abstract   bibtex   
Establishing trust in the models is a long-standing objective in Machine Learning and AI. Information on how data are manipulated before being used for training is instrumental in such understanding, and data provenance can be used to organize and navigate such information. The PROLIT system described in this demo paper is designed to collect, manage, and query the provenance of data as it flows through data preparation pipelines in support of data science analytics and machine learning modeling. PROLIT extends our prior work on transparently collecting data provenance in several directions. Most notably, it employs a LLM to: (i) automatically rewrite user-defined pipelines in a format suitable for this activity, (ii) segment the code to precisely associate provenance to each code snippet, and (iii) provide human-readable descriptions of each snippet that in turn can be used to generate provenance narratives. The demo will showcase these capabilities and offer the opportunity to interact with PROLIT on user-defined as well as pre-defined Python code where dataframes are used as the common data abstraction.

Downloads: 0