Learning Unit 2: Reproducibility

This unit explores reproducibility in data science. Students will learn to implement FAIR principles (Findable, Accessible, Interoperable, Reusable) for datasets and workflows through documentation frameworks like Datasheets for Datasets. We examine common reproducibility failures in real-world cases and industry settings. Practical guidance covers dependency management, literate programming, and licensing strategies. Through hands-on exercises, students will critically reflect on their workflow design choices and documentation practices.

Exercises

Task 1 - Reproducible Executability

In the lecture you learnt about reusability as one of the principles of reproducible data science workflows. One central element for reusability is executability. A central goal here is to deal with the "It works on my machine" problem, as code may (and often does) behave differently on different machines and over time, mostly due to missing packages, version mismatches and OS differences.

You learnt that there are specific tools in python which help with virtualization and dependency management. Your goal now is to take your streamlit application code from last week's assigment and get it running in a virtual enviroment.

Task 2 - Git intro

In task 1, the executability aspect of the reproducibility goal was explored. In order to ensure accessability, the common practice for open source code is the publication on Git hosts like gitlab or github.
The early integration of git into your workflow serves another key aspect: Version control through sequential check-ins of new project updates and the work on different branches allow for better troubleshooting and collective work.

The resource provided here gives an intro into using Git for version control, progress tracking and collaborative work.