datalad-handbook/docs/usecases/reproducible_neuroimaging_analysis.rst

.. index:: ! Usecase; Reproducible Neuroimaging
.. _usecase_reproduce_neuroimg:

An automatically and computationally reproducible neuroimaging analysis from scratch
------------------------------------------------------------------------------------

This use case sketches the basics of a portable analysis that can be
automatically computationally reproduced, starting from the
acquisition of a neuroimaging dataset with a magnetic resonance imaging (MRI)
scanner up to complete data analysis results:

#. Two extension packages, `datalad-container <https://github.com/datalad/datalad-container>`_
   and `datalad-neuroimaging <https://github.com/datalad/datalad-neuroimaging>`_
   extend DataLad's functionality with the ability to work with computational
   containers and neuroimaging data workflows.
#. The analysis is conducted in a way that leaves comprehensive provenance
   (including software environments) all the way from the raw data, and
   structures study components in a way that facilitates reuse.
#. It starts with preparing a raw data (dicom) dataset, and subsequently uses
   the prepared data for a general linear model (GLM) based analysis.
#. After completion, data and results are archived, and disk usage of the
   dataset is maximally reduced.

This use case is adapted from the
`ReproIn/DataLad tutorial <https://www.repronim.org/ohbm2018-training/03-01-reproin>`_
by Michael Hanke and Yaroslav Halchenko, given at the 2018 OHBM training course
ran by `ReproNim <https://www.repronim.org>`_.

The Challenge
^^^^^^^^^^^^^

Allan is an exemplary neuroscientist and researcher. He has spent countless
hours diligently learning not only the statistical methods for his research
questions and the software tools for his computations, but also taught
himself about version control and data standards in neuroimaging, such as
the brain imaging data structure (`BIDS <https://bids.neuroimaging.io>`_).
For his final PhD project, he patiently acquires functional MRI data of many
participants, and prepares it according to the BIDS standard afterwards.
It takes him a full week of time and two failed attempts, but he
eventually has a `BIDS-compliant <https://bids-standard.github.io/bids-validator>`_
dataset.

When he writes his analysis scripts he takes extra care to responsibly
version control every change. He happily notices how much cleaner his
directories are, and how he and others can transparently see how his code
evolved. Once everything is set up, he runs his analysis using large and
complex neuroscientific software packages that he installed on his computer a
few years back. Finally, he writes a paper and publishes his findings in a
prestigious peer-reviewed journal. His data and code can be accessed by
others easily, as he makes them publicly available. Colleagues and
supervisors admire him for his wonderful contribution to open science.

However, a few months after publication, Allan starts to get
emails by that report that his scripts do not produce the same
results as the ones reported in the publication. Startled and confused he
investigates what may be the issue. After many sleepless nights he realizes:
The software he used was fairly old! More recent versions of the same
software compute results slightly different, changed function's names, or
fixed discovered bugs in the underlying source code. Shocked, he realizes that
his scripts are even incompatible with the most recent release of the software
package he used and throw an error. Luckily, he can quickly fix this by
adding information about the required software versions to the ``README`` of his
project, and he is grateful for colleagues and other scientists that provide
adjusted versions of his code for more recent software releases. In the end,
his results prove to be robust regardless of software version. But while
Allen shared code and data, not including any information about his *software*
environment prevented his analysis from becoming *computationally*
reproducible.

The DataLad Approach
^^^^^^^^^^^^^^^^^^^^

Even if an analysis workflow is fully captured and version-controlled, and
data and code are being linked, an analysis may not reproduce. Comprehensive
*computational* reproducibility requires that also the *software* involved
in an analysis and its precise versions need to be known.
DataLad can help with this. Using the ``datalad-containers`` extension,
complete software environments can be captured in computational containers,
added to (and thus shared together with) datasets, and linked with commands
and outputs they were used for.

Step-by-Step
^^^^^^^^^^^^

The first part of this Step-by-Step guide details how to computationally
reproducibly and automatically reproducibly perform data preparation from raw
`DICOM <https://www.dicomstandard.org>`_ files to BIDS-compliant
`NIfTI <https://nifti.nimh.nih.gov>`_ images. The actual analysis, a
first-level GLM for a localization task, is performed in the second part. A
final paragraph shows how to prepare the dataset for the afterlife.

For this use case, two DataLad extensions are required:

- `datalad-container <https://github.com/datalad/datalad-container>`_ and
- `datalad-neuroimaging <https://github.com/datalad/datalad-neuroimaging>`_

You can install them via ``pip`` like this::

   $ pip install datalad-neuroimaging datalad-container

Data Preparation
""""""""""""""""

We start by creating a home for the raw data:

.. runrecord:: _examples/repro2-101
   :language: console
   :workdir: usecases/repro2

   $ datalad create localizer_scans
   $ cd localizer_scans

For this example, we use a number of publicly available DICOM files. Luckily,
at the time of data acquisition, these DICOMs were already equipped with the
relevant metadata: Their headers contain all necessary information to
identify the purpose of individual scans and encode essential properties to
create a BIDS compliant dataset from them. The DICOMs are stored on Github
(as a Git repository [#f1]_), so they can be installed as a subdataset. As
they are the raw inputs of the analysis, we store them in a directory we call
``inputs/raw``.

.. runrecord:: _examples/repro2-102
   :language: console
   :workdir: usecases/repro2/localizer_scans

   $ datalad clone --dataset . \
    https://github.com/datalad/example-dicom-functional.git  \
    inputs/rawdata

The :dlcmd:`subdatasets` reports the installed dataset to be indeed
a subdataset of the superdataset ``localizer_scans``:

.. runrecord:: _examples/repro2-103
   :language: console
   :workdir: usecases/repro2/localizer_scans

   $ datalad subdatasets

Given that we have obtained *raw* data, this data is not yet ready for data
analysis. Prior to performing actual computations, the data needs to be
transformed into appropriate formats and standardized to an intuitive layout.
For neuroimaging, a useful transformation is a transformation from
DICOMs into the NIfTI format, a format specifically designed for scientific
analyses of brain images. An intuitive layout is the BIDS standard.
Performing these transformations and standardizations, however, requires
software. For the task at hand, `HeudiConv <https://heudiconv.readthedocs.io>`_,
a DICOM converter, is our software of choice. Beyond converting DICOMs, it
also provides assistance in converting a raw data set to the BIDS standard,
and it integrates with DataLad to place converted and original data under
Git/Git-annex version control, while automatically annotating files with
sensitive information (e.g., non-defaced anatomicals, etc).

To take extra care to know exactly what software is used both to be
able to go back to it at a later stage should we have the
need to investigate an issue, and to capture *full* provenance of the
transformation process, we are using a software container that contains the
relevant software setup.
A ready-made `singularity <https://singularity.lbl.gov>`_ container is
available from `singularity-hub <https://singularity-hub.org>`_ at
`shub://ReproNim/ohbm2018-training:heudiconvn <shub://ReproNim/ohbm2018-training:heudiconvn>`_.

Using the :dlcmd:`containers-add` command we can add this container
to the ``localizer_scans`` superdataset. We are giving it the name
``heudiconv``.

.. runrecord:: _examples/repro2-104
   :language: console
   :workdir: usecases/repro2/localizer_scans
   :realcommand: datalad containers-add heudiconv --call-fmt 'singularity exec -B {{pwd}} {img} {cmd}'  --url shub://ReproNim/ohbm2018-training:heudiconvn

   $ datalad containers-add heudiconv --url shub://ReproNim/ohbm2018-training:heudiconvn

The command :dlcmd:`containers-list` can verify that this worked:

.. runrecord:: _examples/repro2-105
   :language: console
   :workdir: usecases/repro2/localizer_scans

   $ datalad containers-list

Great. The dataset now tracks all of the input data *and* the computational
environment for the DICOM conversion. Thus far, we have a complete record of
all components. Let's stay transparent, but also automatically reproducible
in the actual data conversion by wrapping the necessary ``heudiconv`` command
seen below::

   $ heudiconv -f reproin -s 02 -c dcm2niix -b -l "" --minmeta -a . \
    -o /tmp/heudiconv.sub-02 --files inputs/rawdata/dicoms

within a :dlcmd:`containers-run` command.
To save time, we will only transfer one subjects data (``sub-02``, hence the
subject identifier ``-s 02`` in the command). Note that the output below is
how it indeed should look like -- the software we are using in this example
produces very wordy output.

.. runrecord:: _examples/repro2-106
   :language: console
   :workdir: usecases/repro2/localizer_scans

   $ datalad containers-run -m "Convert sub-02 DICOMs into BIDS" \
     --container-name heudiconv \
     'heudiconv -f reproin -s 02 -c dcm2niix -b -l "" --minmeta -a . -o /tmp/heudiconv.sub-02 --files inputs/rawdata/dicoms'

Find out what changed after this command by comparing the most recent commit
by DataLad (i.e., ``HEAD``) to the previous one (i.e., ``HEAD~1``) with
:dlcmd:`diff`:

.. runrecord:: _examples/repro2-107
   :language: console
   :workdir: usecases/repro2/localizer_scans

   $ datalad diff -f HEAD~1

As expected, DICOM files of one subject were converted into NIfTI files,
**and** the outputs follow the BIDS standard's layout and naming conventions!
But what's even better is that this action and the relevant software
environment was fully recorded.

There is only one thing missing before the functional imaging data can be
analyzed: A stimulation protocol, so that we know what stimulation was done
at which point during the scan.
Thankfully, the data was collected using an implementation that exported
this information directly in the BIDS events.tsv format. The file came with
our DICOM dataset and can be found at ``inputs/rawdata/events.tsv``. All we need
to do is copy it to the right location under the BIDS-mandated name. To keep
track of where this file came from, we will also wrap the copying into a
:dlcmd:`run` command. The ``{inputs}`` and ``{outputs}``
placeholders can help to avoid duplication in the command call:

.. runrecord:: _examples/repro2-108
   :language: console
   :workdir: usecases/repro2/localizer_scans

   $ datalad run -m "Import stimulation events" \
     --input inputs/rawdata/events.tsv \
     --output sub-02/func/sub-02_task-oneback_run-01_events.tsv \
     cp {inputs} {outputs}

``git log`` shows what information DataLad captured about this command's
execution:

.. runrecord:: _examples/repro2-109
   :language: console
   :workdir: usecases/repro2/localizer_scans

   $ git log -n 1


Analysis execution
""""""""""""""""""

Since the raw data are reproducibly prepared in BIDS standard we can now go
further and conduct an analysis. For this example, we will implement a very
basic first-level GLM analysis using `FSL <https://fsl.fmrib.ox.ac.uk>`__
that takes only a few minutes to run. As before, we will capture all provenance:
inputs, computational environments, code, and outputs.

Following the YODA principles [#f2]_, the analysis is set up in a new
dataset, with the input dataset ``localizer_scans`` as a subdataset:

.. runrecord:: _examples/repro2-110
   :language: console
   :workdir: usecases/repro2/localizer_scans

   # get out of localizer_scans
   $ cd ../

   $ datalad create glm_analysis
   $ cd glm_analysis

We install ``localizer_scans`` by providing its path as a ``--source`` to
:dlcmd:`install`:

.. runrecord:: _examples/repro2-111
   :language: console
   :workdir: usecases/repro2/glm_analysis

   $ datalad clone -d . \
     ../localizer_scans \
     inputs/rawdata

:dlcmd:`subdatasets` reports the number of installed subdatasets
again:

.. runrecord:: _examples/repro2-112
   :language: console
   :workdir: usecases/repro2/glm_analysis

   $ datalad subdatasets

We almost forgot something really useful: Structuring the dataset with
the help of DataLad! Luckily, procedures such as ``yoda`` can not only be
applied upon creating of a dataset (as in :ref:`createDS`), but also with the
:dlcmd:`run-procedure` command (as in :ref:`procedures`)

.. runrecord:: _examples/repro2-113
   :language: console
   :workdir: usecases/repro2/glm_analysis

   $ datalad run-procedure cfg_yoda


The analysis obviously needs custom code. For the simple GLM analysis with
FSL we use:

#. A small script to convert BIDS-formatted ``events.tsv`` files into the
   ``EV3`` format FSL understands, available at
   `https://raw.githubusercontent.com/myyoda/ohbm2018-training/master/section23/scripts/events2ev3.sh <https://raw.githubusercontent.com/myyoda/ohbm2018-training/master/section23/scripts/events2ev3.sh>`_

#. An FSL analysis configuration template script, available at
   `https://raw.githubusercontent.com/myyoda/ohbm2018-training/master/section23/scripts/ffa_design.fsf <https://raw.githubusercontent.com/myyoda/ohbm2018-training/master/section23/scripts/ffa_design.fsf>`_

These script should be stored and tracked inside the dataset within ``code/``.
The :dlcmd:`download-url` command downloads these scripts *and*
records where they were obtained from:

.. runrecord:: _examples/repro2-114
   :language: console
   :workdir: usecases/repro2/glm_analysis

   $ datalad download-url  --path code/ \
     https://raw.githubusercontent.com/myyoda/ohbm2018-training/master/section23/scripts/events2ev3.sh \
     https://raw.githubusercontent.com/myyoda/ohbm2018-training/master/section23/scripts/ffa_design.fsf

The commit message that DataLad created shows the URL where each script has
been downloaded from:

.. runrecord:: _examples/repro2-115
   :language: console
   :workdir: usecases/repro2/glm_analysis

   $ git log -n 1

Prior to the actual analysis, we need to run the ``events2ev3.sh`` script to
transform inputs into the format that FSL expects. The :dlcmd:`run`
makes this maximally reproducible and easy, as the files given as
``--inputs`` and ``--outputs`` are automatically managed by DataLad.

.. runrecord:: _examples/repro2-116
   :workdir: usecases/repro2/glm_analysis
   :language: console

   $ datalad run -m 'Build FSL EV3 design files' \
     --input inputs/rawdata/sub-02/func/sub-02_task-oneback_run-01_events.tsv \
     --output 'sub-02/onsets' \
     bash code/events2ev3.sh sub-02 {inputs}

The dataset now contains and manages all of the required inputs, and we're
ready for FSL. Since FSL is not a simple program, we make sure to record the
precise software environment for the analysis with
:dlcmd:`containers-run`. First, we get a container with FSL in the
version we require:


.. runrecord:: _examples/repro2-117
   :workdir: usecases/repro2/glm_analysis
   :language: console
   :realcommand: datalad containers-add fsl --call-fmt 'singularity exec -B {{pwd}} {img} {cmd}'  --url shub://mih/ohbm2018-training:fsl

   $ datalad containers-add fsl --url shub://mih/ohbm2018-training:fsl

As the analysis setup is now complete, let's label this state of the dataset:

.. runrecord:: _examples/repro2-118
   :workdir: usecases/repro2/glm_analysis
   :language: console

   $ datalad save --version-tag ready4analysis

All we have left is to configure the desired first-level GLM analysis with
FSL. At this point, the template contains placeholders for the ``basepath``
and the subject ID, and they need to be replaced.
The following command uses the arcane, yet powerful :term:`sed` editor to do
this. We will again use :dlcmd:`run` to invoke our command so that
we store in the history how this template was generated (so that we may
audit, alter, or regenerate this file in the future — fearlessly).

.. runrecord:: _examples/repro2-119
   :workdir: usecases/repro2/glm_analysis
   :language: console

   $ datalad run \
    -m "FSL FEAT analysis config script" \
    --output sub-02/1stlvl_design.fsf \
    bash -c 'sed -e "s,##BASEPATH##,{pwd},g" -e "s,##SUB##,sub-02,g" \
    code/ffa_design.fsf > {outputs}'

To compute the analysis, a simple ``feat sub-02/1stlvl_design.fsf`` command
is wrapped into a ``datalad containers-run`` command with appropriate
``--input`` and ``--output`` specification:

.. runrecord:: _examples/repro2-120
   :language: console
   :workdir: usecases/repro2/glm_analysis
   :lines: 1-12, 356-

   $ datalad containers-run --container-name fsl -m "sub-02 1st-level GLM" \
     --input sub-02/1stlvl_design.fsf \
     --input sub-02/onsets \
     --input inputs/rawdata/sub-02/func/sub-02_task-oneback_run-01_bold.nii.gz \
     --output sub-02/1stlvl_glm.feat \
     feat {inputs[0]}

Once this command finishes, DataLad will have captured the entire FSL output,
and the dataset will contain a complete record all the way from the input BIDS
dataset to the GLM results. The BIDS subdataset in turn has a
complete record of all processing down from the raw DICOMs onwards.

.. importantnote:: Many files need more planning

   See how many files were created and added in this computation of a single
   participant? If your study has many participants, analyses like the one above
   could inflate your dataset. Please check out the chapter :ref:`chapter_gobig`.
   in particular the section :ref:`big_analysis` for tips and tricks on how to
   create analyses datasets that scale.


Archive data and results
""""""""""""""""""""""""

After study completion it is important to properly archive data and results,
for example, for future inquiries by reviewers or readers of the associated
publication. Thanks to the modularity of the study units, this tasks is easy
and avoids needless duplication.

The raw data is tracked in its own dataset (``localizer_scans``) that only
needs to be archived once, regardless of how many analysis are using it as
input. This means that we can “throw away” this subdataset copy within this
analysis dataset. DataLad can reobtain the correct version at any point in
the future, as long as the recorded location remains accessible.

To make sure we're not deleting information we are not aware of,
:dlcmd:`diff` and :gitcmd:`log` can help to verify that the
subdataset is in the same state as when it was initially added:

.. runrecord:: _examples/repro2-121
   :language: console
   :workdir: usecases/repro2/glm_analysis

   $ datalad diff -- inputs

The command does not show any output, thus indicating that there is indeed no
difference. ``git log`` confirms that the only action that was performed on
``inputs/`` was the addition of it as a subdataset:

.. runrecord:: _examples/repro2-122
   :language: console
   :workdir: usecases/repro2/glm_analysis

   $ git log -- inputs

Since the state of the subdataset is exactly the state of the original
``localizer_scans`` dataset it is safe to uninstall it.

.. runrecord:: _examples/repro2-123
   :language: console
   :workdir: usecases/repro2/glm_analysis

   $ datalad uninstall --dataset . inputs --recursive

Prior to archiving the results, we can go one step further and verify their
computational reproducibility. DataLad's ``rerun`` command is
capable of “replaying” any recorded command. The following command
re-executes the FSL analysis by rerunning everything since the dataset was
tagged as ``ready4analysis``). It will record the recomputed results in a
separate Git branch named ``verify``. Afterwards, we can automatically
compare these new results to the original ones in the ``master`` branch. We
will see that all outputs can be reproduced in bit-identical form. The only
changes are observed in log files that contain volatile information, such as
time steps.

.. runrecord:: _examples/repro2-124
   :language: console
   :workdir: usecases/repro2/glm_analysis
   :lines: 1-17, 362-

   $ datalad rerun --branch verify --onto ready4analysis --since ready4analysis

.. runrecord:: _examples/repro2-125
   :language: console
   :workdir: usecases/repro2/glm_analysis

   # check that we are now on the new `verify` branch
   $ git branch

.. runrecord:: _examples/repro2-126
   :language: console
   :workdir: usecases/repro2/glm_analysis

   # compare which files have changes with respect to the original results
   $ git diff master --stat


.. runrecord:: _examples/repro2-127
   :language: console
   :workdir: usecases/repro2/glm_analysis

   # switch back to the master branch and remove the `verify` branch
   $ git checkout master
   $ git branch -D verify

The outcome of this use case can be found as a dataset on Github
`here <https://github.com/myyoda/demo-dataset-glmanalysis>`_.


.. rubric:: Footnotes

.. [#f1] "Why can such data exist as a Git repository, shouldn't large files
         be always stored outside of Git?" you may ask. The DICOMs exist in a
         Git-repository for a number of reasons: First, it makes them easily
         available for demonstrations and tutorials without involving DataLad
         at all. Second, the DICOMs are *comparatively* small: 21K per file.
         Importantly, the repository is not meant to version control those
         files *and* future states or derivatives and results obtained from
         them -- this would bring a Git repositories to its knees.

.. [#f2] To re-read everything about the YODA principles, checkout out section
         :ref:`yoda`.