FUnctionality Sharing In Open eNvironments
Heinz Nixdorf Chair for Distributed Information Systems

Tracking Provenance in Machine Learning Scripts (taken)

Title: Tracking Provenance in Machine Learning Scripts (taken)
Author(s): Dominik Kerzel
Supervisor(s): Dr. Sheeba Samuel, Prof. Dr. Birgitta König-Ries
School: Friedrich-Schiller-Universität Jena
Thesis Type: Bachelor
Date: 2021-04-01
Abstract: According to the Oxford Dictionary, provenance is defined as “the source or origin of an object; its history or pedigree”. Provenance of a data product is its description along with the explanation of how and why it got to the current state. Machine Learning (ML) is an emerging tool currently being applied in various application areas including medicine, computer vision, security, privacy, etc. The tremendous growth of data and scripts requires the need for provenance tracking from Machine Learning Scripts. The task in this thesis is how to automatically identify the relationships between data and ML models from the scripts. How to track which datasets and columns have been used to derive the features of a ML model? The developed solution will track and create a report on the provenance of the Machine Learning Scripts. As a start, the solution will capture information from Python scripts. The developed solution will be evaluated with the Machine Learning Scripts available in GitHub.