distribits #1
4 changed files with 406 additions and 0 deletions
1
distribits-2025/figures/ddorf.jpg
Symbolic link
1
distribits-2025/figures/ddorf.jpg
Symbolic link
|
|
@ -0,0 +1 @@
|
|||
../../.git/annex/objects/Vm/GP/SHA256E-s367193--38caa10cc8ee13d55f9e6721600e55d90c2511befca574eec955afdbb55b2ea1.jpg/SHA256E-s367193--38caa10cc8ee13d55f9e6721600e55d90c2511befca574eec955afdbb55b2ea1.jpg
|
||||
|
After Width: | Height: | Size: 202 B |
1
distribits-2025/figures/individual-vs-template.png
Symbolic link
1
distribits-2025/figures/individual-vs-template.png
Symbolic link
|
|
@ -0,0 +1 @@
|
|||
../../.git/annex/objects/KF/GV/SHA256E-s111766--1f456cb2a78d0d8c35418deb81a256b7481a130fd87216203c60874904bfc277.png/SHA256E-s111766--1f456cb2a78d0d8c35418deb81a256b7481a130fd87216203c60874904bfc277.png
|
||||
|
After Width: | Height: | Size: 202 B |
1
distribits-2025/figures/templateflow_fig-templates.png
Symbolic link
1
distribits-2025/figures/templateflow_fig-templates.png
Symbolic link
|
|
@ -0,0 +1 @@
|
|||
../../.git/annex/objects/6q/7f/SHA256E-s1629740--60c77d1998ae4920616921eb3c86ee99cad93e3acf77ddfac9e06c832c5d2231.png/SHA256E-s1629740--60c77d1998ae4920616921eb3c86ee99cad93e3acf77ddfac9e06c832c5d2231.png
|
||||
|
After Width: | Height: | Size: 204 B |
403
distribits-2025/index.qmd
Normal file
403
distribits-2025/index.qmd
Normal file
|
|
@ -0,0 +1,403 @@
|
|||
---
|
||||
title: Compute on demand
|
||||
subtitle: an fMRIPrep use case
|
||||
author: "[Michał Szczepanik](https://mszczepanik.eu)"
|
||||
institute: Forschungszentrum Jülich
|
||||
date: 2025-10-24
|
||||
format:
|
||||
revealjs:
|
||||
footer: "{{< meta title >}} - <https://distribits.live>"
|
||||
code-annotations: hover
|
||||
---
|
||||
|
||||
# Introduction
|
||||
|
||||
## Special remotes
|
||||
|
||||
> Don't envision a special remote as merely a physical place or
|
||||
> location -- a special-remote is a protocol that defines the
|
||||
> underlying transport of your files to and/or from a specific
|
||||
> location.
|
||||
>
|
||||
> --- DataLad Handbook, p. 194
|
||||
|
||||
To `get` files:
|
||||
|
||||
- download from S3, Nextcloud, web...
|
||||
- extract from archive
|
||||
- (re)create?!
|
||||
|
||||
## Independent implementations
|
||||
|
||||
In this talk:
|
||||
|
||||
- [git-annex compute](https://git-annex.branchable.com/special_remotes/compute/) (built-in) by JoeyH
|
||||
- [DataLad remake](https://github.com/datalad/datalad-remake/) (extension / unreleased) by PsyInf
|
||||
|
||||
Prior art:
|
||||
|
||||
- [DataLad getexec](https://github.com/matrss/datalad-getexec) (extension / unreleased) by Matrss
|
||||
|
||||
## Credit
|
||||
|
||||
- git-annex & git-annex compute:
|
||||
- Joey Hess
|
||||
- DataLad remake:
|
||||
- the Psychoinformatics Group (INM-7, FZ Jülich)
|
||||
- Christian Mönch, Gosia Wierzba, Michael Hanke
|
||||
- [eBRAIN-Health (HORIZON-INFRA-2021-TECH-01-01, grant no. 101058516)](https://cordis.europa.eu/project/id/101058516)
|
||||
|
||||
## Use cases
|
||||
|
||||
"Storage is cheap", right?
|
||||
|
||||
- provide data in alternative (file) formats (store CSV, provide XLSX on demand)
|
||||
- render partial data for specific purposes (cut source video into clips)
|
||||
- apply edits to a photo (RAW to JPEG)
|
||||
- apply spatial transformations to fMRI images
|
||||
|
||||
## Example task (tutorial / comparison)
|
||||
|
||||
::: {.callout-note .incremental}
|
||||
we'll do fMRI later
|
||||
:::
|
||||
|
||||

|
||||
|
||||
```
|
||||
gmic input image.jpg map_clut kodak_kodachrome_64 output kodachromed.jpg
|
||||
```
|
||||
|
||||
::: footer
|
||||
Photo by [Nicholas Peyrol](https://unsplash.com/@nicolaspeyrol)
|
||||
on [Unsplash](https://unsplash.com/photos/city-skyline-under-blue-sky-during-daytime-l2VmsBG8nPE)
|
||||
:::
|
||||
|
||||
# git-annex compute
|
||||
|
||||
## Compute program
|
||||
|
||||
```{.python code-line-numbers="7-9|12,14|16,18|20,22-25" filename=~/.local/bin/git-annex-compute-clut}
|
||||
#!/usr/bin/env python3
|
||||
import argparse
|
||||
import subprocess
|
||||
import sys
|
||||
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("in")
|
||||
parser.add_argument("out")
|
||||
parser.add_argument("clut")
|
||||
args = parser.parse_args()
|
||||
|
||||
sys.stdout.write(f"INPUT {args.in}\n")
|
||||
sys.stdout.flush()
|
||||
input_file = sys.stdin.readline().rstrip()
|
||||
|
||||
sys.stdout.write(f"OUTPUT {args.out}\n")
|
||||
sys.stdout.flush()
|
||||
output_tempfile = sys.stdin.readline().rstrip()
|
||||
|
||||
subprocess.run(
|
||||
[
|
||||
"gmic",
|
||||
"input", input_file,
|
||||
"map_clut", args.clut,
|
||||
"output", output_tempfile,
|
||||
]
|
||||
)
|
||||
|
||||
```
|
||||
|
||||
## In action
|
||||
|
||||
prerequisites (to enable remote)
|
||||
|
||||
```
|
||||
git config --global annex.security.allowed-compute-programs \
|
||||
git-annex-compute-clut
|
||||
```
|
||||
|
||||
usage
|
||||
|
||||
```
|
||||
git annex initremote clut type=compute program=git-annex-compute-clut
|
||||
git annex addcomputed \
|
||||
[--fast]
|
||||
[--reproducible]
|
||||
--to clut
|
||||
foo.jpg foo_k64.jpg kodak_kodachrome_64
|
||||
```
|
||||
|
||||
# DataLad remake
|
||||
|
||||
## Compute template
|
||||
```{.toml code-line-numbers="1|2-6" filename=".datalad/make/methods/clut.toml"}
|
||||
parameters = ["in", "out", "clut"]
|
||||
command = [
|
||||
"gmic",
|
||||
"input", "{in}",
|
||||
"map_clut", "{clut}",
|
||||
"output", "{out}"
|
||||
]
|
||||
```
|
||||
|
||||
## In action
|
||||
|
||||
prerequisites
|
||||
|
||||
```
|
||||
git config --global --add datalad.make.trusted-keys <key-id>
|
||||
```
|
||||
|
||||
make
|
||||
|
||||
```
|
||||
datalad -c commit.gpgsign=true save -m "Add compute template"
|
||||
datalad -c commit.gpgsign=true make \
|
||||
-i foo.jpg
|
||||
-o foo_k64.jpg
|
||||
-p in=foo.jpg
|
||||
-p out=foo_k64.jpg
|
||||
-p clut=kodak_kodachrome_64
|
||||
clut.toml
|
||||
```
|
||||
|
||||
## What gets recorded?
|
||||
|
||||
```{.json filename=".datalad/make/specifications/06a6ca0708e839a5ecea95d6d1bed9a3"}
|
||||
{
|
||||
"input": [
|
||||
"foo.jpg"
|
||||
],
|
||||
"method": "clut.toml",
|
||||
"output": [
|
||||
"foo_k64.jpg"
|
||||
],
|
||||
"parameter": {
|
||||
"clut": "kodak_kodachrome64",
|
||||
"in": "foo.jpg",
|
||||
"out": "foo_k64.jpg"
|
||||
},
|
||||
"stdout": null
|
||||
}
|
||||
```
|
||||
|
||||
```
|
||||
datalad-remake:///?label=clut.toml
|
||||
&root_version=0dc52b9eeca5838144ad07c3766cdc4ef84c37cf
|
||||
&specification=06a6ca0708e839a5ecea95d6d1bed9a3
|
||||
&this=foo_k64.jpg
|
||||
```
|
||||
|
||||
# Comparison
|
||||
|
||||
## compute & remake
|
||||
|
||||
| | git-annex compute | datalad remake |
|
||||
|---------------|---------------------------------|------------------------------------|
|
||||
| specification | program / protocol | template / config |
|
||||
| branch | git-annex | main (git-annex URL) |
|
||||
| provision | paths to (annex) objects | secondary git worktree (in `/tmp`) |
|
||||
| trust | executable in PATH + git config | signed commit + git config |
|
||||
| submodules | one repo | subdatasets |
|
||||
| reproducible | option | only |
|
||||
|
||||
## what about `datalad run` / `rerun`?
|
||||
|
||||
:::: {.columns}
|
||||
|
||||
::: {.column width="50%"}
|
||||
Run record:
|
||||
|
||||
- stored in commit message
|
||||
- used by `rerun`
|
||||
- may commit
|
||||
- uses branches (default: current HEAD)
|
||||
- provenance capture
|
||||
:::
|
||||
|
||||
::: {.column width="50%"}
|
||||
Make spec:
|
||||
|
||||
- stored in file
|
||||
- used by `get`
|
||||
- never commits, "slow download"
|
||||
- always temporary worktree, past state
|
||||
- storage reduction
|
||||
- more flexible
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
# fMRIPrep
|
||||
|
||||
## Motivation: spatial normalization
|
||||
|
||||
::: {layout-ncol=2}
|
||||
{width=300}
|
||||
|
||||
{width=400}
|
||||
:::
|
||||
|
||||
transforms: slow to compute ‧ small to store ‧ quick to apply
|
||||
|
||||
::: footer
|
||||
left image: adapted from [fMRIPrep](https://fmriprep.org/en/stable/) docs;
|
||||
right image: [TemplateFlow](https://www.templateflow.org/) docs
|
||||
:::
|
||||
|
||||
## Enablers
|
||||
|
||||
:::: {.columns}
|
||||
|
||||
::: {.column width=50%}
|
||||
|
||||
- BIDS
|
||||
- Brain Imaging Data Structure
|
||||
- standardized file names
|
||||
- sidecar metadata
|
||||
|
||||
:::
|
||||
|
||||
::: {.column width="50%"}
|
||||
|
||||
- fMRIPrep
|
||||
- state-of-the-art data preprocessing pipeline
|
||||
- made for BIDS
|
||||
- widely adopted
|
||||
- easy to select templates
|
||||
- modular (Nipype)
|
||||
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
## Dataset
|
||||
|
||||
Output of `datalad run fmriprep ...`
|
||||
|
||||
``` {.txt code-line-numbers=false code-line-numbers="|2-3|5,7|8-10|8,11"}
|
||||
[DS~0] /tmp/ds005479-remake-demo
|
||||
├── inputs/
|
||||
│ └── [DS~1] ds005479/
|
||||
└── sub-01/
|
||||
├── anat/
|
||||
│ ├── preproc_T1w.nii.gz
|
||||
│ ├── from-T1w_to-MNI152NLin2009cAsym_xfm.h5 # 90 MB
|
||||
└── func/
|
||||
├── from-boldref_to-T1w_desc-coreg_xfm.txt # 369 B
|
||||
├── from-orig_to-boldref_desc-hmc_xfm.txt # 84 kB
|
||||
└── space-MNI152NLin2009cAsym_desc-preproc_bold.nii.gz # 432 MB 🖜
|
||||
```
|
||||
|
||||
<https://hub.datalad.org/mslw/ds005479-remake-demo>
|
||||
|
||||
## Code
|
||||
|
||||
```{.python filename="code/resample.py"}
|
||||
from fmriprep.workflows.bold.apply import init_bold_volumetric_resample_wf
|
||||
|
||||
# Step 1: figure out data dependencies, parameters
|
||||
|
||||
# Step 2: set up workflow
|
||||
|
||||
# Step 3: connect inputs
|
||||
|
||||
# Step 4: run the pipeline
|
||||
|
||||
# Step 5: tweak file header
|
||||
```
|
||||
|
||||
ca. 360 LOC
|
||||
|
||||
<https://hub.datalad.org/mslw/fmriprep-resampling>
|
||||
|
||||
[with thanks to Chris Markiewicz for suggestions]{style="font-size: 50%;"}
|
||||
|
||||
|
||||
## Compute template
|
||||
|
||||
```{.toml filename=".datalad/make/methods/shortcut.toml"}
|
||||
parameters = ["target_file"]
|
||||
command = [
|
||||
"python",
|
||||
"code/resample.py",
|
||||
"{target_file}",
|
||||
"inputs/ds005479",
|
||||
"."
|
||||
]
|
||||
```
|
||||
|
||||
::: {.callout-tip}
|
||||
use pinned (locked) requirements / run inside container
|
||||
:::
|
||||
|
||||
## Data dependencies
|
||||
|
||||
```{.txt code-line-numbers=false filename=".datalad/make/inputs/sub-01_task-MID_space-MNI152NLin2009cAsym"}
|
||||
inputs/ds005479/sub-01/func/sub-01_task-MID_bold.json
|
||||
inputs/ds005479/sub-01/func/sub-01_task-MID_bold.nii.gz
|
||||
sub-01/anat/sub-01_from-T1w_to-MNI152NLin2009cAsym_mode-image_xfm.h5
|
||||
sub-01/fmap/sub-01_fmapid-auto00000_desc-coeff_fieldmap.nii.gz
|
||||
sub-01/fmap/sub-01_fmapid-auto00000_desc-epi_fieldmap.nii.gz
|
||||
sub-01/fmap/sub-01_fmapid-auto00000_desc-preproc_fieldmap.json
|
||||
sub-01/func/sub-01_task-MID_desc-hmc_boldref.nii.gz
|
||||
sub-01/func/sub-01_task-MID_from-boldref_to-T1w_mode-image_desc-coreg_xfm.txt
|
||||
sub-01/func/sub-01_task-MID_from-boldref_to-auto00000_mode-image_xfm.txt
|
||||
sub-01/func/sub-01_task-MID_from-orig_to-boldref_mode-image_desc-hmc_xfm.txt
|
||||
sub-01/func/sub-01_task-MID_space-MNI152NLin2009cAsym_boldref.nii.gz
|
||||
sub-01/func/sub-01_task-MID_space-MNI152NLin2009cAsym_desc-brain_mask.nii.gz
|
||||
sub-01/func/sub-01_task-MID_space-MNI152NLin2009cAsym_desc-preproc_bold.json
|
||||
```
|
||||
|
||||
::: {.callout-note}
|
||||
this file is temporary, needs not be committed
|
||||
:::
|
||||
|
||||
## Prospective instruction
|
||||
|
||||
create:
|
||||
``` {.bash code-line-numbers=false}
|
||||
TARGET=sub-01_task-MID_space-MNI152NLin2009cAsym
|
||||
|
||||
datalad make \
|
||||
--prospective-execution \
|
||||
--input-list .datalad/make/inputs/${TARGET} \
|
||||
--output sub-01/func/${TARGET}_desc-preproc_bold.nii.gz \
|
||||
--parameter target_file=sub-01/func/${TARGET}_desc-preproc_bold.nii.gz \
|
||||
shortcut.toml
|
||||
```
|
||||
|
||||
then:
|
||||
``` {.bash code-line-numbers=false}
|
||||
datalad drop ...
|
||||
datalad get -s datalad-remake-auto ... # a few minutes
|
||||
```
|
||||
|
||||
## Worth it?
|
||||
|
||||
- kept 100 MB, dropped 430 MB --- 300 MB gain
|
||||
- × 2--3 output spaces --- 0.5--1 GB gain
|
||||
- × 2--3 runs --- 2--3+ GB gain
|
||||
- × 50 subjects --- 0.1 TB for an average study
|
||||
- noticeable chunk of a project quota
|
||||
- even more for large projects
|
||||
|
||||
# Coda
|
||||
|
||||
## Limitations
|
||||
|
||||
- hard to debug (code runs inside special remote)
|
||||
- very situational
|
||||
- many-to-many not efficient
|
||||
- not tested at scale
|
||||
- the remake that never was:
|
||||
- CWL
|
||||
- metadata instead of Git repo
|
||||
|
||||
## Key messages
|
||||
|
||||
- DataLad and git-annex now provide compute-on-demand
|
||||
- room for further development
|
||||
- praise fMRIPrep for reproducibility and modularity
|
||||
- have fun with your examples
|
||||
Loading…
Add table
Add a link
Reference in a new issue