FUnctionality Sharing In Open eNvironments
Heinz Nixdorf Chair for Distributed Information Systems

Tracking Provenance in Machine Learning Scripts (Bachelor or Master)


According to the Oxford Dictionary, provenance is defined as “the source or origin of an object; its history or pedigree”. Provenance of a data product is its description along with the explanation of how and why it got to the current state. Machine Learning (ML) is an emerging tool currently being applied in various application areas including medicine, computer vision, security, privacy, etc. The tremendous growth of data and scripts requires the need for provenance tracking from Machine Learning Scripts. The task in this project is how to automatically identify the relationships between data and ML models from the scripts. How to track which datasets and columns have been used to derive the features of a ML model? The developed solution will track and create a report on the provenance of the Machine Learning Scripts. As a start, the solution will capture information from Python scripts. The developed solution will be evaluated with the Machine Learning Scripts available in GitHub.

What are the tasks?

  • State of the art of provenance tracking from scripts
  • Concept and development of provenance tracking and visualization UI
  • Evaluation of the developed solution with the Machine Learning Scripts available in GitHub.

What do we expect?

  • Programming knowledge in Python, Javascript, HTML, and CSS
  • Foundations of Machine Learning

What do we offer?

  • A nice working environment
  • An interdisciplinary and international team

Start: May 2020 or later
Supervisor: Sheeba Samuel (sheeba.samuel@uni-jena.de), Birgitta Koenig-Ries (birgitta.koenig-ries@uni-jena.de)