Helmholtz Metadata Collaboration | Conference 2023

Name: Helmholtz Metadata Collaboration | Conference 2023
Start: 2023-10-10T09:30:00+02:00
End: 2023-10-12T15:30:00+02:00
Location: virtual, details will be shared with you after registration

10–12 Oct 2023

virtual, details will be shared with you after registration

Europe/Berlin timezone

Contact

event@helmholtz-metadaten.de

Enhancing Transparency and Reproducibility in AI Model Training through Provenance-Enabled Data Preprocessing and Workflow Documentation

11 Oct 2023, 11:10

20m

Room 2

Talk Parallel Track 2

Nils Hoffmann

As the scale and complexity of AI models continue to grow, the demand for vast amounts of data, including unlabelled and uncurated datasets, has become increasingly prevalent. To address this challenge, the role of data preprocessing, filtering, augmentation, and curation using automated methods has gained increasing significance in ensuring optimal model performance. However, while AI models themselves are increasingly documented, the transparency surrounding the preprocessing techniques applied often remains incomplete, impeding reproducibility.

This paper proposes a novel approach aimed at improving transparency and reproducibility in training scenarios by capturing concrete provenance information throughout the data preprocessing workflow. By recording the sequence of transformations applied to the data, researchers and developers gain the ability to recreate workflows and analyze sample provenance, improving informed decision-making. Simultaneously, this approach offers a pathway for gathering metadata, which serves debugging, monitoring, and development purposes, exposing developers to valuable insights.

The proposed solution is realized through the introduction of a lightweight data pipeline library designed for seamless chainable stream operations. Focused on this common use case, this library enables the systematic capture of structured provenance information, reducing the overhead of subsequent analysis. Demonstrated in the context of a compact computer vision use case, the methodology not only exposes useful training metrics but also comprehensively documents the workflow with negligible performance implications.

Further work involves expanding the application of this approach to encompass larger and more complex use cases. Additionally, the integration of intermediate dataset versioning, facilitated by a dedicated DVC plugin, allows for including intermediate data versioning and traceability, thereby improving on the overall reproducibility and transparency of AI model training workflows.

In addition please add keywords.

Provenance, Metadata collection, transparent AI, ML Training workflows

Please assign your contribution to one of the following topics	Metadata annotation and management close to the research process
Please assign yourself (presenting author) to one of the stakeholders.	Researchers

Nils Hoffmann

HIFIS.pdf

Helmholtz Metadata Collaboration | Conference 2023

Contact

Enhancing Transparency and Reproducibility in AI Model Training through Provenance-Enabled Data Preprocessing and Workflow Documentation

Room 2

Speaker

Description

In addition please add keywords.

Primary author

Presentation materials

Choose timezone

Helmholtz Metadata Collaboration | Conference 2023

Contact

Speaker

Description

In addition please add keywords.

Primary author

Presentation materials