https://www.merriam-webster.com/grammar/cannot-vs-can-not-is-there-a-difference
215 lines
8.8 KiB
ReStructuredText
215 lines
8.8 KiB
ReStructuredText
.. index:: ! Usecase; Collaboration
|
||
.. _usecase_collab:
|
||
|
||
A typical collaborative data management workflow
|
||
------------------------------------------------
|
||
|
||
This use case sketches the basics of a common, collaborative
|
||
data management workflow for an analysis:
|
||
|
||
#. A 3rd party dataset is obtained to serve as input for an analysis.
|
||
#. Data processing is collaboratively performed by two colleagues.
|
||
#. Upon completion, the results are published alongside the original data
|
||
for further consumption.
|
||
|
||
The data types and methods mentioned in this use case belong to the scientific
|
||
field of neuroimaging, but the basic workflow is domain-agnostic.
|
||
|
||
The Challenge
|
||
^^^^^^^^^^^^^
|
||
|
||
Bob is a new PhD student and about to work on his first analysis.
|
||
He wants to use an open dataset as the input for his analysis, so he asks
|
||
a friend who has worked with the same dataset for the data and gets it
|
||
on a hard drive.
|
||
Later, he's stuck with his analysis. Luckily, Alice, a senior grad
|
||
student in the same lab, offers to help him. He sends his script to
|
||
her via email and hopes she finds the solution to his problem. She
|
||
responds a week later with the fixed script, but in the meantime
|
||
Bob already performed some miscellaneous changes to his script as well.
|
||
Identifying and integrating her fix into his slightly changed script
|
||
takes him half a day. When he finally finishes his analysis, he wants to
|
||
publish code and data online, but cannot find a way to share his data
|
||
together with his code.
|
||
|
||
|
||
The DataLad Approach
|
||
^^^^^^^^^^^^^^^^^^^^
|
||
|
||
Bob creates his analysis project as a DataLad dataset. Complying with
|
||
the :ref:`YODA principles <yoda>`,
|
||
he creates his scripts in a dedicated
|
||
``code/`` directory, and clones the open dataset as a standalone
|
||
DataLad subdataset within a dedicated subdirectory.
|
||
To collaborate with his senior grad
|
||
student Alice, he shares the dataset on the lab's SSH server, and they
|
||
can collaborate on the version controlled dataset almost in real time
|
||
with no need for Bob to spend much time integrating the fix that Alice
|
||
provides him with. Afterwards, Bob can execute his scripts in a way that captures
|
||
all provenance for this results with a :dlcmd:`run` command.
|
||
Bob can share his whole project after completion by creating a sibling
|
||
on a web server, and pushing all of his dataset, including the input data,
|
||
to this sibling, for everyone to access and recompute.
|
||
|
||
Step-by-Step
|
||
^^^^^^^^^^^^
|
||
|
||
Bob creates a DataLad dataset for his analysis project to live in.
|
||
Because he knows about the YODA principles, he configures the dataset
|
||
to be a YODA dataset right at the time of creation:
|
||
|
||
.. runrecord:: _examples/collab-101
|
||
:workdir: usecases/collab
|
||
:language: console
|
||
|
||
$ datalad create -c yoda --description "my 1st phd project on work computer" myanalysis
|
||
|
||
After creation, there already is a ``code/`` directory, and all of its
|
||
inputs are version-controlled by :term:`Git` instead of :term:`git-annex`
|
||
thanks to the yoda procedure:
|
||
|
||
.. runrecord:: _examples/collab-102
|
||
:workdir: usecases/collab
|
||
:language: console
|
||
|
||
$ cd myanalysis
|
||
$ tree
|
||
|
||
.. index::
|
||
pair: clone; DataLad command
|
||
|
||
Bob knows that a DataLad dataset can contain other datasets. He also knows that
|
||
as any content of a dataset is tracked and its precise state is recorded,
|
||
this is a powerful method to specify and later resolve data dependencies,
|
||
and that including the dataset as a standalone data component will it also
|
||
make it easier to keep his analysis organized and share it later.
|
||
The dataset that Bob wants to work with is structural brain imaging data from the
|
||
`studyforrest project <https://www.studyforrest.org>`_, a public
|
||
data resource that the original authors share as a DataLad dataset through
|
||
:term:`GitHub`. This means that Bob can simply clone the relevant dataset from this
|
||
service and into his own dataset. To do that, he clones it as a subdataset
|
||
into a directory he calls ``src/`` as he wants to make it obvious which parts
|
||
of his analysis steps and code require 3rd party data:
|
||
|
||
.. runrecord:: _examples/collab-103
|
||
:workdir: usecases/collab/myanalysis
|
||
:language: console
|
||
|
||
$ datalad clone -d . https://github.com/psychoinformatics-de/studyforrest-data-structural.git src/forrest_structural
|
||
|
||
Now that he executed this command, Bob has access to the entire dataset
|
||
content, and the precise version of the dataset got linked to his top-level dataset
|
||
``myanalysis``. However, no data was actually downloaded (yet). Bob very much
|
||
appreciates that DataLad datasets primarily contain information on a dataset’s
|
||
content and where to obtain it: Cloning above was done rather
|
||
quickly, and will still be relatively lean even for a dataset that contains
|
||
several hundred GBs of data. He knows that his script can obtain the
|
||
relevant data he needs on demand if he wraps it into a :dlcmd:`run`
|
||
command and therefore does not need to care about getting the data yet. Instead,
|
||
he focuses to write his script ``code/run_analysis.sh``.
|
||
To save this progress, he runs frequent :dlcmd:`save` commands:
|
||
|
||
.. runrecord:: _examples/collab-104
|
||
:workdir: usecases/collab/myanalysis
|
||
:language: console
|
||
:realcommand: echo "#! /usr/bin/env python" > code/run_analysis.py && datalad save -m "First steps: start analysis script" code/run_analysis.py
|
||
|
||
$ datalad save -m "First steps: start analysis script" code/run_analysis.py
|
||
|
||
Once Bob's analysis is finished, he can wrap it into :dlcmd:`run`.
|
||
To ease execution, he first makes his script executable by adding a :term:`shebang`
|
||
that specifies Python as an interpreter at the start of his script, and giving it
|
||
executable :term:`permissions`:
|
||
|
||
.. runrecord:: _examples/collab-105
|
||
:workdir: usecases/collab/myanalysis
|
||
:language: console
|
||
|
||
$ chmod +x code/run_analysis.py
|
||
$ datalad save -m "make script executable"
|
||
|
||
Importantly, prior to a :dlcmd:`run`, he specifies the necessary
|
||
inputs such that DataLad can take care of the data retrieval for him:
|
||
|
||
.. runrecord:: _examples/collab-106
|
||
:workdir: usecases/collab/myanalysis
|
||
:language: console
|
||
:realcommand: datalad run -m "run first part of analysis workflow" --input "src/forrest_structural/sub-01/anat/sub-01_T1w.nii.gz" --output results.txt "code/run_analysis.py"
|
||
|
||
$ datalad run -m "run first part of analysis workflow" \
|
||
--input "src/forrest_structural" \
|
||
--output results.txt \
|
||
"code/run_analysis.py"
|
||
|
||
This will take care of retrieving the data, running Bobs script, and
|
||
saving all outputs.
|
||
|
||
Some time later, Bob needs help with his analysis. He turns to his senior
|
||
grad student Alice for help. Alice and Bob both work on the same computing server.
|
||
Bob has told Alice in which directory he keeps his analysis dataset, and
|
||
the directory is configured to have :term:`permissions` that allow for
|
||
read-access for all lab-members, so Alice can obtain Bob’s work directly
|
||
from his home directory:
|
||
|
||
.. runrecord:: _examples/collab-107
|
||
:workdir: usecases/collab
|
||
:language: console
|
||
:realcommand: echo "$ datalad clone "$BOBS_HOME/myanalysis" bobs_analysis" && datalad clone "myanalysis" bobs_analysis
|
||
|
||
.. runrecord:: _examples/collab-108
|
||
:workdir: usecases/collab
|
||
:language: console
|
||
:realcommand: cd bobs_analysis && echo "some contribution" >> code/run_analysis.py && datalad save
|
||
|
||
$ cd bobs_analysis
|
||
# ... make contributions, and save them
|
||
$ [...]
|
||
$ datalad save -m "you're welcome, bob"
|
||
|
||
|
||
Alice can get the studyforrest data Bob used as an input as well as the
|
||
result file, but she can also rerun his analysis by using :dlcmd:`rerun`.
|
||
She goes ahead and fixes Bobs script, and saves the changes. To integrate her
|
||
changes into his dataset, Bob registers Alice's dataset as a sibling:
|
||
|
||
.. runrecord:: _examples/collab-109
|
||
:workdir: usecases/collab/myanalysis
|
||
:language: console
|
||
:realcommand: echo "$ datalad siblings add -s alice --url '$ALICES_HOME/bobs_analysis'" && datalad siblings add -s alice --url '../bobs_analysis'
|
||
|
||
#in Bobs home directory
|
||
|
||
Afterwards, he can get her changes with a :dlcmd:`update --merge`
|
||
command:
|
||
|
||
|
||
.. runrecord:: _examples/collab-110
|
||
:workdir: usecases/collab/myanalysis
|
||
:language: console
|
||
|
||
$ datalad update -s alice --merge
|
||
|
||
|
||
.. index::
|
||
pair: create-sibling; DataLad command
|
||
|
||
Finally, when Bob is ready to share his results with the world or a remote
|
||
collaborator, he makes his dataset available by uploading them to a web server
|
||
via SSH. Bob does so by creating a sibling for the dataset on the server, to
|
||
which the dataset can be published and later also updated.
|
||
|
||
.. code-block:: bash
|
||
|
||
# this generated sibling for the dataset and all subdatasets
|
||
$ datalad create-sibling --recursive -s public "$SERVER_URL"
|
||
|
||
Once the remote sibling is created and registered under the name “public”,
|
||
Bob can publish his version to it.
|
||
|
||
.. code-block:: bash
|
||
|
||
$ datalad push -r --to public .
|
||
|
||
This workflow allowed Bob to obtain data, collaborate with Alice, and publish
|
||
or share his dataset with others easily -- he cannot wait for his next project,
|
||
given that this workflow made his life so simple.
|