Skip to article content

Reproducible Machine Learning Workflows for Scientists with Pixi

Introduction

Introduction

A critical component of research software sustainability is the reproducibility of the software and computing environments software operates and “lives” in. Providing software as a “package” — a standardized distribution of all source or binary components of the software required for use along with identifying metadata — goes a long way to improved reproducibility of software libraries. However, more researchers are consumers of libraries than developers of them, but still need reproducible computing environments for research software applications that may be run across multiple computing platforms — e.g. scientific analyses, visualization tools, data transformation pipelines, and artificial intelligence (AI) and machine learning (ML) applications on hardware accelerator platforms (e.g. GPUs). While workflow engines and Linux containers offer a gold standard for scientific computing reproducibility, they require additional layers of training and software engineering knowledge. Modern open source multi-platform environment management tools, e.g. Arts et al., n.d., provide automatic multi-platform hash-level lock file support for all dependencies — down to the compiler level — of software on public package indexes (e.g. PyPI The Python Package Index (PyPI), n.d. and conda-forge conda-forge community, 2015) while still providing a high level interface well suited for researchers. Combined with the arrival of the full CUDA Nickolls et al., 2008 stack on conda-forge, it is now possible to declaratively specify a full CUDA accelerated software environment. We are now at a point where well supported, robust technological solutions exist, even for applications with highly complex software environments. What is currently lacking is the education and training by the broader scientific software community to adopt these technologies and build community standards of practice around them, as well as an understanding of what are the most actionably useful features of adopting computational reproducibility tools.

Reproducibility

“Reproducible” research is a term that can mean multiple things across various fields. Some fields may view work as “reproducible” if the full process is documented, and other may view “reproducible” as meaning that all computations will give the same numerical outputs barring entropy variations. As there are multiple levels of reproducibility, we will restrict “reproducibility” to software environment reproducibility. We define this as be limited to the ability to define and programmatically create a software environment composed of packages that specifies all software, and its dependencies, with exact URLs and binary digests (“hashes”). Reproducible environments need to be machine agnostic in that for a specified computing platform in the environment they must be installable without modification across multiple instances.

1Hardware accelerated environments

Software the involves hardware acceleration on computing resources like GPUs requires additional information to be provided for full computational reproducibility. In addition to the computer platform, information about the hardware acceleration device, its supported drivers, and compatible hardware accelerated versions of the software in the environment (GPU enabled builds) are required. Traditionally this has been very difficult to do, but multiple recent technological advancements (made possible by social agreements and collaborations) in the scientific open source world now provide solutions to these problems.

References
  1. Arts, R., Zalmstra, B., Vollprecht, W., de Jager, T., Morcotilo, N., & Hofer, J. (n.d.). pixi. https://github.com/prefix-dev/pixi/releases/tag/v0.48.0
  2. The Python Package Index (PyPI). (n.d.). https://pypi.org/
  3. conda-forge community. (2015). The conda-forge Project: Community-based Software Distribution Built on the conda Package Format and Ecosystem. 10.5281/zenodo.4774216
  4. Nickolls, J., Buck, I., Garland, M., & Skadron, K. (2008). Scalable parallel programming with CUDA. ACM SIGGRAPH 2008 Classes. 10.1145/1401132.1401152
Reproducible Machine Learning Workflows for Scientists with Pixi
CUDA
Reproducible Machine Learning Workflows for Scientists with Pixi
Pixi