FUnctionality Sharing In Open eNvironments
Heinz Nixdorf Chair for Distributed Information Systems

Machine Learning Pipelines: Provenance, Reproducibility and FAIR Data Principles.

Title: Machine Learning Pipelines: Provenance, Reproducibility and FAIR Data Principles.
Authors: Sheeba Samuel, Frank Löffler, Birgitta König-Rie
Source: Provenance Week 2021
Place: Provenance and Annotation of Data and Processes - 8th and 9th International Provenance and Annotation Workshop, IPAW 2020 + IPAW 2021, Virtual Event, July 19-22, 2021
Date: 2021-07-19
Type: Publication

Machine learning (ML) is an increasingly important scientific tool supporting decision making and knowledge generation in numerous fields. With this, it also becomes more and more important that the results of ML experiments are reproducible. Unfortunately, that often is not the case. Rather, ML, similar to many other disciplines, faces a reproducibility crisis. In this paper, we describe our goals and initial steps in supporting the end-to-end reproducibility of ML pipelines. We investigate which factors beyond the availability of source code and datasets influence reproducibility of ML experiments. We propose ways to apply FAIR data practices to ML workflows. We present our preliminary results on the role of our tool, ProvBook, in capturing and comparing provenance of ML experiments and their reproducibility using Jupyter Notebooks. We also present the ReproduceMeGit tool to analyze the reproducibility of ML pipelines described in Jupyter Notebooks.

URL: https://doi.org/10.1007/978-3-030-80960-7_17
    author="Samuel, Sheeba
    and L{\"o}ffler, Frank
    and K{\"o}nig-Ries, Birgitta",
    editor="Glavic, Boris
    and Braganholo, Vanessa
    and Koop, David",
    title="Machine Learning Pipelines: Provenance, Reproducibility and FAIR Data Principles",
    booktitle="Provenance and Annotation of Data and Processes",
    publisher="Springer International Publishing",