datalad-handbook/docs/usecases/datasets.rst

169 lines
6.4 KiB
ReStructuredText

.. _usecase_neuroimaging_datasets:
***********************
A Neuroimaging Datasets
***********************
.. todo::
Currently, this is a left over. Later, we can rework this into something,
but it's unclear yet what ;-)
This section is a concise demonstration of what a DataLad dataset is,
showcased on a dataset from the field of neuroimaging.
A DataLad dataset is the core data type of DataLad. We will explore the concepts
of it with one public example dataset, the studyforrest phase 2 data (studyforrest.org).
Note that this is just one type and use of a DataLad dataset, and you throughout there are
many more flavors of using DataLad datasets in the basics or in upcoming use cases.
Please follow along and run the commands below in your own terminal for
a hands-on experience.
.. runrecord:: _examples/dataset
:language: console
:workdir: usecases/studyforrest
$ datalad install https://github.com/psychoinformatics-de/studyforrest-data-phase2.git
Once installed, a DataLad dataset looks like any other directory on your file system:
.. runrecord:: _examples/dataset2
:language: console
:workdir: usecases/studyforrest
:lines: 1-2, 8-18
$ cd studyforrest-data-phase2
$ ls # output below is only an excerpt from ls
However, all files and directories within the DataLad dataset can be
tracked (should you want them to be tracked), regardless of their size.
Large content is tracked in an *annex* that is automatically
created and handled by DataLad. Whether text files or larger files change,
all of these changes can be written to your DataLad datasets history.
.. gitusernote:: Large-file tracking
A DataLad dataset is a Git repository. Large file content in the
dataset in the annex is tracked with git-annex. An ``ls -a``
reveals that Git is secretly working in the background:
.. runrecord:: _examples/dataset3
:language: console
:lines: 1, 5-11, 15-25
:emphasize-lines: 3, 5-6, 8
:workdir: usecases/studyforrest/studyforrest-data-phase2
$ ls -a # show also hidden files (excerpt)
Users can *create* new DataLad datasets from scratch, or install existing
DataLad datasets from paths, urls, or open-data collections. This makes
sharing and accessing data fast and easy. Moreover, when sharing or installing
a DataLad dataset, all copies also include the datasets history. An installed DataLad
dataset knows the dataset it was installed from, and if changes
in this original DataLad dataset happen, the installed dataset can simply be updated.
You can view the DataLad datasets history with tools of your choice.
The code block below is used to illustrate the history and is an exempt
from :gitcmd:`log`.
.. runrecord:: _examples/dataset4
:language: console
:lines: 1-10
:workdir: usecases/studyforrest/studyforrest-data-phase2
$ git log --oneline --graph --decorate
Dataset content identity and availability information
=====================================================
Upon installation of a DataLad dataset, DataLad retrieves only (small) metadata
information about the dataset. This exposes the datasets file hierarchy
for exploration, and speeds up the installation of a DataLad dataset
of many TB in size to a few seconds. Just after installation, the dataset is
small in size:
.. runrecord:: _examples/dataset5
:language: console
:workdir: usecases/studyforrest/studyforrest-data-phase2
$ du -sh
This is because only small files are present locally -- for shits and giggles, you can try
opening both a small ``.tsv`` file in the root of the dataset,
and a larger compressed ``nifti`` (``nii.gz``) in one of the subdirectories in this dataset.
A small ``.tsv`` (1.9K) file exists and can be opened locally,
but what would be a large, compressed ``nifti`` file
is not. In this state, one cannot open or work with the nifti file, but you can
explore which files exist without the potentially large download.
.. runrecord:: _examples/dataset6
:language: console
:emphasize-lines: 3
:workdir: usecases/studyforrest/studyforrest-data-phase2
$ ls participants.tsv sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz
The retrieval of the actual, potentially large
file content can happen at any later time for the full dataset or subsets
of files. Let's get the nifti file:
.. runrecord:: _examples/dataset7
:language: console
:workdir: usecases/studyforrest/studyforrest-data-phase2
$ datalad get sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz
Wasn't this easy?
Dataset Nesting
===============
Within DataLad datasets one can *nest* other DataLad
datasets arbitrarily deep. This does not seem particularly spectacular -
after all, any directory on a file system can have other directories inside it.
The possibility for nested Datasets, however, is one of many advantages
DataLad datasets have:
Any lower-level DataLad dataset (the *subdataset*) has a stand-alone
history. The top-level DataLad dataset (the *superdataset*) only stores
*which version* of the subdataset is currently used.
By taking advantage of dataset nesting, one can take datasets such as the
studyforrest phase-2 data and install it as a subdataset within a
superdataset containing analysis code and results computed from the
studyforrest data. Should the studyforrest data get extended or changed,
its subdataset can be updated to include the changes easily. More
detailed examples of this can be found in the use cases in the last
section (for example in :ref:`usecase_reproducible_paper`).
The figure below illustrates dataset nesting in a neuroimaging context
schematically:
.. figure:: ../artwork/src/img/virtual_dirtree.svg
:alt: Virtual directory tree of a nested DataLad dataset
Creating your own dataset yourself
==================================
Anyone can create, populate, and optionally share a *new* DataLad dataset.
A new DataLad dataset is always created empty, even if the target
directory already contains additional files or directories. After creation,
arbitrarily large amounts of data can be added. Once files are added and
saved to the dataset, any changes done to these data files can be saved
to the history.
.. gitusernote:: Create internals
Creation of datasets relies on the :gitcmd:`init` and :gitannexcmd:`init` commands.
As already shown, already existing datalad dataset can be simply installed
from a url or path, or from the datalad open-data collection.
.. gitusernote:: Install internals
:dlcmd:`install` used the :gitcmd:`clone` command.