distribits #1

Manually merged
msz merged 6 commits from distribits into main 2025-10-26 17:21:39 +00:00
4 changed files with 406 additions and 0 deletions

View file

@ -0,0 +1 @@
../../.git/annex/objects/Vm/GP/SHA256E-s367193--38caa10cc8ee13d55f9e6721600e55d90c2511befca574eec955afdbb55b2ea1.jpg/SHA256E-s367193--38caa10cc8ee13d55f9e6721600e55d90c2511befca574eec955afdbb55b2ea1.jpg

After

Width:  |  Height:  |  Size: 202 B

View file

@ -0,0 +1 @@
../../.git/annex/objects/KF/GV/SHA256E-s111766--1f456cb2a78d0d8c35418deb81a256b7481a130fd87216203c60874904bfc277.png/SHA256E-s111766--1f456cb2a78d0d8c35418deb81a256b7481a130fd87216203c60874904bfc277.png

After

Width:  |  Height:  |  Size: 202 B

View file

@ -0,0 +1 @@
../../.git/annex/objects/6q/7f/SHA256E-s1629740--60c77d1998ae4920616921eb3c86ee99cad93e3acf77ddfac9e06c832c5d2231.png/SHA256E-s1629740--60c77d1998ae4920616921eb3c86ee99cad93e3acf77ddfac9e06c832c5d2231.png

After

Width:  |  Height:  |  Size: 204 B

403
distribits-2025/index.qmd Normal file
View file

@ -0,0 +1,403 @@
---
title: Compute on demand
subtitle: an fMRIPrep use case
author: "[Michał Szczepanik](https://mszczepanik.eu)"
institute: Forschungszentrum Jülich
date: 2025-10-24
format:
revealjs:
footer: "{{< meta title >}} - <https://distribits.live>"
code-annotations: hover
---
# Introduction
## Special remotes
> Don't envision a special remote as merely a physical place or
> location -- a special-remote is a protocol that defines the
> underlying transport of your files to and/or from a specific
> location.
>
> --- DataLad Handbook, p. 194
To `get` files:
- download from S3, Nextcloud, web...
- extract from archive
- (re)create?!
## Independent implementations
In this talk:
- [git-annex compute](https://git-annex.branchable.com/special_remotes/compute/) (built-in) by JoeyH
- [DataLad remake](https://github.com/datalad/datalad-remake/) (extension / unreleased) by PsyInf
Prior art:
- [DataLad getexec](https://github.com/matrss/datalad-getexec) (extension / unreleased) by Matrss
## Credit
- git-annex & git-annex compute:
- Joey Hess
- DataLad remake:
- the Psychoinformatics Group (INM-7, FZ Jülich)
- Christian Mönch, Gosia Wierzba, Michael Hanke
- [eBRAIN-Health (HORIZON-INFRA-2021-TECH-01-01, grant no. 101058516)](https://cordis.europa.eu/project/id/101058516)
## Use cases
"Storage is cheap", right?
- provide data in alternative (file) formats (store CSV, provide XLSX on demand)
- render partial data for specific purposes (cut source video into clips)
- apply edits to a photo (RAW to JPEG)
- apply spatial transformations to fMRI images
## Example task (tutorial / comparison)
::: {.callout-note .incremental}
we'll do fMRI later
:::
![](figures/ddorf.jpg)
```
gmic input image.jpg map_clut kodak_kodachrome_64 output kodachromed.jpg
```
::: footer
Photo by [Nicholas Peyrol](https://unsplash.com/@nicolaspeyrol)
on [Unsplash](https://unsplash.com/photos/city-skyline-under-blue-sky-during-daytime-l2VmsBG8nPE)
:::
# git-annex compute
## Compute program
```{.python code-line-numbers="7-9|12,14|16,18|20,22-25" filename=~/.local/bin/git-annex-compute-clut}
#!/usr/bin/env python3
import argparse
import subprocess
import sys
parser = argparse.ArgumentParser()
parser.add_argument("in")
parser.add_argument("out")
parser.add_argument("clut")
args = parser.parse_args()
sys.stdout.write(f"INPUT {args.in}\n")
sys.stdout.flush()
input_file = sys.stdin.readline().rstrip()
sys.stdout.write(f"OUTPUT {args.out}\n")
sys.stdout.flush()
output_tempfile = sys.stdin.readline().rstrip()
subprocess.run(
[
"gmic",
"input", input_file,
"map_clut", args.clut,
"output", output_tempfile,
]
)
```
## In action
prerequisites (to enable remote)
```
git config --global annex.security.allowed-compute-programs \
git-annex-compute-clut
```
usage
```
git annex initremote clut type=compute program=git-annex-compute-clut
git annex addcomputed \
[--fast]
[--reproducible]
--to clut
foo.jpg foo_k64.jpg kodak_kodachrome_64
```
# DataLad remake
## Compute template
```{.toml code-line-numbers="1|2-6" filename=".datalad/make/methods/clut.toml"}
parameters = ["in", "out", "clut"]
command = [
"gmic",
"input", "{in}",
"map_clut", "{clut}",
"output", "{out}"
]
```
## In action
prerequisites
```
git config --global --add datalad.make.trusted-keys <key-id>
```
make
```
datalad -c commit.gpgsign=true save -m "Add compute template"
datalad -c commit.gpgsign=true make \
-i foo.jpg
-o foo_k64.jpg
-p in=foo.jpg
-p out=foo_k64.jpg
-p clut=kodak_kodachrome_64
clut.toml
```
## What gets recorded?
```{.json filename=".datalad/make/specifications/06a6ca0708e839a5ecea95d6d1bed9a3"}
{
"input": [
"foo.jpg"
],
"method": "clut.toml",
"output": [
"foo_k64.jpg"
],
"parameter": {
"clut": "kodak_kodachrome64",
"in": "foo.jpg",
"out": "foo_k64.jpg"
},
"stdout": null
}
```
```
datalad-remake:///?label=clut.toml
&root_version=0dc52b9eeca5838144ad07c3766cdc4ef84c37cf
&specification=06a6ca0708e839a5ecea95d6d1bed9a3
&this=foo_k64.jpg
```
# Comparison
## compute & remake
| | git-annex compute | datalad remake |
|---------------|---------------------------------|------------------------------------|
| specification | program / protocol | template / config |
| branch | git-annex | main (git-annex URL) |
| provision | paths to (annex) objects | secondary git worktree (in `/tmp`) |
| trust | executable in PATH + git config | signed commit + git config |
| submodules | one repo | subdatasets |
| reproducible | option | only |
## what about `datalad run` / `rerun`?
:::: {.columns}
::: {.column width="50%"}
Run record:
- stored in commit message
- used by `rerun`
- may commit
- uses branches (default: current HEAD)
- provenance capture
:::
::: {.column width="50%"}
Make spec:
- stored in file
- used by `get`
- never commits, "slow download"
- always temporary worktree, past state
- storage reduction
- more flexible
:::
::::
# fMRIPrep
## Motivation: spatial normalization
::: {layout-ncol=2}
![](figures/individual-vs-template.png){width=300}
![](figures/templateflow_fig-templates.png){width=400}
:::
transforms: slow to compute ‧ small to store ‧ quick to apply
::: footer
left image: adapted from [fMRIPrep](https://fmriprep.org/en/stable/) docs;
right image: [TemplateFlow](https://www.templateflow.org/) docs
:::
## Enablers
:::: {.columns}
::: {.column width=50%}
- BIDS
- Brain Imaging Data Structure
- standardized file names
- sidecar metadata
:::
::: {.column width="50%"}
- fMRIPrep
- state-of-the-art data preprocessing pipeline
- made for BIDS
- widely adopted
- easy to select templates
- modular (Nipype)
:::
::::
## Dataset
Output of `datalad run fmriprep ...`
``` {.txt code-line-numbers=false code-line-numbers="|2-3|5,7|8-10|8,11"}
[DS~0] /tmp/ds005479-remake-demo
├── inputs/
│ └── [DS~1] ds005479/
└── sub-01/
├── anat/
│ ├── preproc_T1w.nii.gz
│ ├── from-T1w_to-MNI152NLin2009cAsym_xfm.h5 # 90 MB
└── func/
├── from-boldref_to-T1w_desc-coreg_xfm.txt # 369 B
├── from-orig_to-boldref_desc-hmc_xfm.txt # 84 kB
└── space-MNI152NLin2009cAsym_desc-preproc_bold.nii.gz # 432 MB 🖜
```
<https://hub.datalad.org/mslw/ds005479-remake-demo>
## Code
```{.python filename="code/resample.py"}
from fmriprep.workflows.bold.apply import init_bold_volumetric_resample_wf
# Step 1: figure out data dependencies, parameters
# Step 2: set up workflow
# Step 3: connect inputs
# Step 4: run the pipeline
# Step 5: tweak file header
```
ca. 360 LOC
<https://hub.datalad.org/mslw/fmriprep-resampling>
[with thanks to Chris Markiewicz for suggestions]{style="font-size: 50%;"}
## Compute template
```{.toml filename=".datalad/make/methods/shortcut.toml"}
parameters = ["target_file"]
command = [
"python",
"code/resample.py",
"{target_file}",
"inputs/ds005479",
"."
]
```
::: {.callout-tip}
use pinned (locked) requirements / run inside container
:::
## Data dependencies
```{.txt code-line-numbers=false filename=".datalad/make/inputs/sub-01_task-MID_space-MNI152NLin2009cAsym"}
inputs/ds005479/sub-01/func/sub-01_task-MID_bold.json
inputs/ds005479/sub-01/func/sub-01_task-MID_bold.nii.gz
sub-01/anat/sub-01_from-T1w_to-MNI152NLin2009cAsym_mode-image_xfm.h5
sub-01/fmap/sub-01_fmapid-auto00000_desc-coeff_fieldmap.nii.gz
sub-01/fmap/sub-01_fmapid-auto00000_desc-epi_fieldmap.nii.gz
sub-01/fmap/sub-01_fmapid-auto00000_desc-preproc_fieldmap.json
sub-01/func/sub-01_task-MID_desc-hmc_boldref.nii.gz
sub-01/func/sub-01_task-MID_from-boldref_to-T1w_mode-image_desc-coreg_xfm.txt
sub-01/func/sub-01_task-MID_from-boldref_to-auto00000_mode-image_xfm.txt
sub-01/func/sub-01_task-MID_from-orig_to-boldref_mode-image_desc-hmc_xfm.txt
sub-01/func/sub-01_task-MID_space-MNI152NLin2009cAsym_boldref.nii.gz
sub-01/func/sub-01_task-MID_space-MNI152NLin2009cAsym_desc-brain_mask.nii.gz
sub-01/func/sub-01_task-MID_space-MNI152NLin2009cAsym_desc-preproc_bold.json
```
::: {.callout-note}
this file is temporary, needs not be committed
:::
## Prospective instruction
create:
``` {.bash code-line-numbers=false}
TARGET=sub-01_task-MID_space-MNI152NLin2009cAsym
datalad make \
--prospective-execution \
--input-list .datalad/make/inputs/${TARGET} \
--output sub-01/func/${TARGET}_desc-preproc_bold.nii.gz \
--parameter target_file=sub-01/func/${TARGET}_desc-preproc_bold.nii.gz \
shortcut.toml
```
then:
``` {.bash code-line-numbers=false}
datalad drop ...
datalad get -s datalad-remake-auto ... # a few minutes
```
## Worth it?
- kept 100 MB, dropped 430 MB --- 300 MB gain
- × 2--3 output spaces --- 0.5--1 GB gain
- × 2--3 runs --- 2--3+ GB gain
- × 50 subjects --- 0.1 TB for an average study
- noticeable chunk of a project quota
- even more for large projects
# Coda
## Limitations
- hard to debug (code runs inside special remote)
- very situational
- many-to-many not efficient
- not tested at scale
- the remake that never was:
- CWL
- metadata instead of Git repo
## Key messages
- DataLad and git-annex now provide compute-on-demand
- room for further development
- praise fMRIPrep for reproducibility and modularity
- have fun with your examples