This removes the need to have admin-level access to a machine for running the handbook code. This should make testing in a much broader range of environments possible (think HPC accounts, etc). The contributing guide and appveyor setup are adjusted.
727 lines
34 KiB
ReStructuredText
727 lines
34 KiB
ReStructuredText
.. _riastore:
|
|
|
|
Remote indexed archives for dataset storage and backup
|
|
------------------------------------------------------
|
|
|
|
If DataLad datasets should be backed-up, made available for collaborations
|
|
with others, or stored or managed in a central location,
|
|
:term:`remote indexed archive (RIA) store`\s, dataset storage
|
|
locations that allow for access to and collaboration on DataLad datasets, may be
|
|
a suitable solution. They are flat, flexible, file-system based repository
|
|
representations of any number of datasets, and they can exist on all standard computing
|
|
infrastructure, be it personal computers, servers or compute clusters, or even
|
|
super computing infrastructure -- even on machines that do not have DataLad
|
|
installed.
|
|
|
|
Technical details
|
|
^^^^^^^^^^^^^^^^^
|
|
|
|
RIA stores can be created or extended with a single command from within any
|
|
dataset. DataLad datasets can subsequently be published into the datastore as a
|
|
means of backing up a dataset or creating a dataset sibling to collaborate on
|
|
with others. Alternatively, datasets can be cloned and updated from a RIA store
|
|
just as from any other dataset location.
|
|
The subsection :ref:`riaworkflows` a few paragraphs down will demonstrate RIA-store
|
|
related functionality. But prior to introducing the user-facing commands, this
|
|
section starts by explaining the layout and general concept of a RIA store.
|
|
|
|
Layout
|
|
""""""
|
|
|
|
RIA stores store DataLad datasets. Both the layout of the RIA store and the layout
|
|
of the datasets in the RIA store are different from typical dataset layouts, though.
|
|
If one were to take a look inside of a RIA store as it is set up by default, one
|
|
would see a directory that contains a flat subdirectory tree with datasets
|
|
represented as :term:`bare Git repositories` and an annex. Usually, looking inside
|
|
of RIA stores is not necessary for RIA-related workflows, but it can help to
|
|
grasp the concept of these stores.
|
|
|
|
The first level of subdirectories in this RIA store tree consists of the first three
|
|
characters of the :term:`dataset ID`\s of the datasets that lie in the store,
|
|
and the second level of subdatasets contains the remaining characters of the
|
|
dataset IDs.
|
|
Thus, the first two levels of subdirectories in the tree are split
|
|
dataset IDs of the datasets that are stored in them [#f1]_. The code block below
|
|
illustrates how a single DataLad dataset looks like in a RIA store, and the
|
|
dataset ID of the dataset (``946e8cac-432b-11ea-aac8-f0d5bf7b5561``) is
|
|
highlighted:
|
|
|
|
.. code-block::
|
|
:emphasize-lines: 2-3, 18-41
|
|
|
|
/path/to/my_riastore
|
|
├── 946
|
|
│ └── e8cac-432b-11ea-aac8-f0d5bf7b5561
|
|
│ ├── annex
|
|
│ │ └── objects
|
|
│ │ ├── 6q
|
|
│ │ │ └── mZ
|
|
│ │ │ └── MD5E-s93567133--7c93fc5d0b5f197ae8a02e5a89954bc8.nii.gz
|
|
│ │ │ └── MD5E-s93567133--7c93fc5d0b5f197ae8a02e5a89954bc8.nii.gz
|
|
│ │ ├── 6v
|
|
│ │ │ └── zK
|
|
│ │ │ └── MD5E-s2043924480--47718be3b53037499a325cf1d402b2be.nii.gz
|
|
│ │ │ └── MD5E-s2043924480--47718be3b53037499a325cf1d402b2be.nii.gz
|
|
│ │ ├── [...]
|
|
│ │ └── [...]
|
|
│ ├── archives
|
|
│ │ └── archive.7z
|
|
│ ├── branches
|
|
│ ├── config
|
|
│ ├── description
|
|
│ ├── HEAD
|
|
│ ├── hooks
|
|
│ │ ├── applypatch-msg.sample
|
|
│ │ ├── [...]
|
|
│ │ └── update.sample
|
|
│ ├── info
|
|
│ │ └── exclude
|
|
│ ├── objects
|
|
│ │ ├── 05
|
|
│ │ │ └── 3d25959223e8173497fa7f747442b72c31671c
|
|
│ │ ├── 0b
|
|
│ │ │ └── 8d0edbf8b042998dfeb185fa2236d25dd80cf9
|
|
│ │ ├── [...]
|
|
│ │ │ └── [...]
|
|
│ │ ├── info
|
|
│ │ └── pack
|
|
│ ├── refs
|
|
│ │ ├── heads
|
|
│ │ │ ├── git-annex
|
|
│ │ │ └── main
|
|
│ │ └── tags
|
|
│ ├── ria-layout-version
|
|
│ └── ria-remote-ebce196a-b057-4c96-81dc-7656ea876234
|
|
│ └── transfer
|
|
├── error_logs
|
|
└── ria-layout-version
|
|
|
|
If a second dataset gets published to the RIA store, it will be represented in a
|
|
similar tree structure underneath its individual dataset ID.
|
|
If *subdatasets* of a dataset are published into a RIA store, they are not
|
|
represented *underneath* their superdataset, but are stored on the same hierarchy
|
|
level as any other dataset. Thus, the dataset representation in a RIA store is
|
|
completely flat [#f2]_.
|
|
With this hierarchy-free setup, the location of a particular dataset in the RIA
|
|
store is only dependent on its :term:`dataset ID`. As the dataset ID is universally
|
|
unique, gets assigned to a dataset at the time of creation, and does not change across
|
|
the life time of a dataset, no two different datasets could have the same location
|
|
in a RIA store.
|
|
|
|
The directory underneath the two dataset-ID-based subdirectories contains a
|
|
*bare git repository* (highlighted above as well) that is a :term:`clone` of the
|
|
dataset.
|
|
|
|
.. find-out-more:: What is a bare Git repository?
|
|
|
|
A bare Git repository is a repository that contains the contents of the ``.git``
|
|
directory of regular DataLad datasets or Git repositories, but no worktree
|
|
or checkout. This has advantages: The repository is leaner, it is easier
|
|
for administrators to perform garbage collections, and it is required if you
|
|
want to push to it at all times. You can find out more on what bare repositories
|
|
are and how to use them
|
|
`here <https://git-scm.com/book/en/v2/Git-on-the-Server-Getting-Git-on-a-Server>`__.
|
|
|
|
Note that bare Git repositories can be cloned, and the clone of a bare Git repository
|
|
will have a checkout and a worktree, thus resuming the shape that you are
|
|
familiar with.
|
|
|
|
Inside of the bare :term:`Git` repository, the ``annex`` directory -- just as in
|
|
any standard dataset or repository -- contains the dataset's keystore (object
|
|
tree) under ``annex/objects`` [#f3]_. In conjunction, keystore and bare Git
|
|
repository are the original dataset -- just differently represented, with no
|
|
*working tree*, i.e., directory hierarchy that exists in the original dataset,
|
|
and without the name it was created under, but stored under its dataset ID instead.
|
|
|
|
If necessary, the keystores (annex) can be (compressed) `7zipped <https://7-zip.org>`_
|
|
archives (``archives/``), either for compression gains, or for use on HPC-systems with
|
|
`inode <https://en.wikipedia.org/wiki/Inode>`_ limitations [#f4]_.
|
|
Despite being 7zipped, those archives can be indexed and support
|
|
relatively fast random read access. Thus, the entire key store can be put into an
|
|
archive, reusing the exact same directory structure, and remains fully
|
|
accessible while only using a handful of inodes, regardless of file
|
|
number and size. If the dataset contains only annexed files, a complete dataset
|
|
can be represented in about 25 inodes.
|
|
A detailed example and utility script can be found at `knowledge-base.psychoinformatics.de/kbi/0024 <https://knowledge-base.psychoinformatics.de/kbi/0024>`_.
|
|
|
|
Taking all of the above information together, on an infrastructural level,
|
|
a RIA store is fully self-contained, and is a plain file system storage, not a
|
|
database. Everything inside of a RIA store is either a file, a directory, or
|
|
a zipped archive. It can thus be set up on any infrastructure that has a file
|
|
system with directory and file representation, and has barely any additional
|
|
software requirements (see below). Access to datasets in the store can be managed
|
|
by using file system :term:`permissions`.
|
|
With these attributes, a RIA store is a suitable solution for a number of
|
|
usecases (back-up, single or multi-user dataset storage, central point for
|
|
collaborative workflows, ...), be that on private workstations, web servers,
|
|
compute clusters, or other IT infrastructure.
|
|
|
|
.. find-out-more:: Software Requirements
|
|
|
|
On the RIA store hosting infrastructure, only 7z is to be installed, if the
|
|
archive feature is desired. Specifically, no :term:`Git`, no :term:`git-annex`,
|
|
and no otherwise running daemons are necessary.
|
|
If the RIA store is set up remotely, the server needs to be SSH-accessible.
|
|
|
|
On the client side, you need DataLad.
|
|
|
|
git-annex ORA-remote special remotes
|
|
""""""""""""""""""""""""""""""""""""
|
|
|
|
On a technical level, beyond being a directory tree of datasets, a RIA store
|
|
is by default a :term:`git-annex` ORA-remote (optional remote access) special remote
|
|
of a dataset. This allows to not only store the history of a dataset, but also
|
|
all annexed contents.
|
|
|
|
.. find-out-more:: What is a special remote?
|
|
|
|
A `special-remote <https://git-annex.branchable.com/special_remotes>`_ is an
|
|
extension to Git's concept of remotes, and can enable git-annex to transfer
|
|
data to and from places that are not Git repositories (e.g., cloud services
|
|
or external machines such as an HPC system). Don't envision a special-remote as a
|
|
physical place or location -- a special-remote is just a protocol that defines
|
|
the underlying *transport* of your files *to* and *from* a specific location.
|
|
|
|
The git-annex ora-remote special remote is referred to as a "storage sibling" of
|
|
the original dataset. It is similar to git-annex's built-in
|
|
`directory <https://git-annex.branchable.com/special_remotes/directory>`_
|
|
special remote (but works remotely and uses the ``hashdir_mixed`` [#f2]_ keystore
|
|
layout). Thanks to the git-annex ora-remote, RIA stores can have regular
|
|
git-annex key storage and retrieval of keys from (compressed) 7z archives in
|
|
the RIA store works. Put simple, annexed contents of datasets can only be
|
|
pushed into RIA stores if they have a git-annex ora-remote.
|
|
|
|
|
|
Certain applications will not require special remote features. The usecase
|
|
:ref:`usecase_HCP_dataset`
|
|
shows an example where git-annex key storage is explicitly not wanted.
|
|
Other applications may require *only* the special remote, such as cases where Git isn't installed on the RIA store hosting infrastructure.
|
|
For most storage or back-up scenarios, special remote capabilities are useful, though,
|
|
and thus the default.
|
|
|
|
.. index::
|
|
pair: create-sibling-ria; DataLad command
|
|
|
|
The command :dlcmd:`create-sibling-ria` can both create datasets in RIA stores and the RIA stores themselves.
|
|
However, :dlcmd:`create-sibling-ria` sets up a new RIA store if it does not find one under the provided URL **only** if the parameter ``--new-store-ok`` is passed.
|
|
By default, the command will automatically create a dataset representation in a RIA store and configure a sibling to allow publishing to the RIA store and updating
|
|
from it.
|
|
With special remote capabilities enabled, the command will automatically create
|
|
the special remote as a ``storage-sibling`` and link it to the RIA-sibling.
|
|
With the sibling and special remote set up, upon an invocation of
|
|
:dlcmd:`push --to <sibling>`, the complete dataset contents, including
|
|
annexed contents, will be published to the RIA store, with no further setup or
|
|
configuration required [#f6]_.
|
|
|
|
To disable the storage sibling completely, invoke :dlcmd:`create-sibling-ria` with the argument ``--storage-sibling=off``.
|
|
To create a RIA store with *only* special remote storage, you can invoke :dlcmd:`create-sibling-ria` with the argument ``--storage-sibling=only``.
|
|
|
|
Advantages of RIA stores
|
|
""""""""""""""""""""""""
|
|
Storing datasets in RIA stores has a number of advantages that align well with
|
|
the demands of central dataset management on shared compute infrastructure, but are also
|
|
well suited for most back-up and storage applications.
|
|
In a RIA store layout, the first two levels of subdirectories can host any
|
|
number of keystores and bare repositories. As datasets are identified via ID and
|
|
stored *next to each other* underneath the top-level RIA store directory, the
|
|
store is completely flexible and extendable, and regardless of the number or
|
|
nature of datasets inside of the store, a RIA store keeps a homogeneous directory
|
|
structure. This aids the handling of large numbers of repositories, because
|
|
unique locations are derived from *dataset/repository properties* (their ID)
|
|
rather than a dataset name or a location in a complex dataset hierarchy.
|
|
Because the dataset representation in the RIA store is a bare repository,
|
|
"house-keeping" as well as query tasks can be automated or performed by data
|
|
management personnel with no domain-specific knowledge about dataset contents.
|
|
Short maintenance scripts can be used to automate basically any task that is
|
|
of interest and possible in a dataset, but across the full RIA store.
|
|
A few examples are:
|
|
|
|
- Copy or move annex objects into a 7z archive.
|
|
- Find dataset dependencies across all stored datasets by returning the dataset
|
|
IDs of subdatasets recorded in each dataset.
|
|
- Automatically return the number of commits in each repository.
|
|
- Automatically return the author and time of the last dataset update.
|
|
- Find all datasets associated with specific authors.
|
|
- Clean up unnecessary files and minimize a (or all) repository with :term:`Git`\s
|
|
`garbage collection (gc) <https://git-scm.com/docs/git-gc>`_ command.
|
|
|
|
The use case :ref:`usecase_datastore` demonstrates the advantages of this in a
|
|
large scientific institute with central data management.
|
|
Due to the git-annex ora-remote special remote, datasets can be exported and
|
|
stored as archives to save disk space.
|
|
|
|
.. todo::
|
|
|
|
link to ukb chapter as example
|
|
|
|
.. _riaworkflows:
|
|
|
|
RIA store workflows
|
|
^^^^^^^^^^^^^^^^^^^
|
|
|
|
The user facing commands for interactions with a RIA store are barely different
|
|
from standard DataLad workflows. The paragraphs below detail how to create and
|
|
populate a RIA store, how to clone datasets and retrieve data from it, and also
|
|
how to handle permissions or hide technicalities.
|
|
|
|
.. index::
|
|
pair: create-sibling-ria; DataLad command
|
|
|
|
Creating or publishing to RIA stores
|
|
""""""""""""""""""""""""""""""""""""
|
|
|
|
A dataset can be added into an existing or not yet existing RIA store by
|
|
running the :dlcmd:`create-sibling-ria` command, and subsequently published into
|
|
the store using :dlcmd:`push`.
|
|
Just like the :dlcmd:`siblings add` command,
|
|
for :dlcmd:`create-sibling-ria`, an arbitrary sibling name
|
|
(with the ``-s/--name`` option) and a URL to the location of the store (as a
|
|
positional argument) need to be specified. In the case of RIA stores, the URL
|
|
takes the form of a ``ria+`` URL, and the looks of this URL are dependent
|
|
on where the RIA store (should) exists, or rather, which file transfer protocol
|
|
(``SSH`` or ``file``) is used:
|
|
|
|
- A URL to an :term:`SSH`\-accessible server has a ``ria+ssh://`` prefix, followed
|
|
by user and hostname specification and an **absolute** path:
|
|
``ria+ssh://[user@]hostname/absolute/path/to/ria-store``
|
|
- A URL to a store on a local file system has a ``ria+file://`` prefix,
|
|
followed by an **absolute** path: ``ria+file:///absolute/path/to/ria-store``
|
|
|
|
.. find-out-more:: RIA stores with HTTP access
|
|
|
|
Setting up RIA store with access via HTTP requires additional server-side configurations for Git.
|
|
`Git's http-backend documentation <https://git-scm.com/docs/git-http-backend>`_ can point you the relevant configurations for your web server and usecase.
|
|
|
|
Note that it is always required to specify an :term:`absolute path` in the URL!
|
|
|
|
In addition, as a convenience for cloning, you can supply an ``--alias`` parameter
|
|
with a name under which the dataset can later be cloned from the dataset.
|
|
|
|
.. importantnote:: If you code along, make sure to check the next findoutmore!
|
|
|
|
The upcoming demonstration of RIA stores uses the ``DataLad-101`` dataset
|
|
the was created throughout the Basics of this handbook.
|
|
If you want to execute these code snippets on a ``DataLad-101``
|
|
dataset you created, the modification described in the findoutmore below
|
|
needs to be done first.
|
|
|
|
.. find-out-more:: If necessary, adjust the submodule path!
|
|
|
|
Back in :ref:`subdspublishing`, in order to appropriately reference and link
|
|
subdatasets on hostings sites such as :term:`GitHub`, we adjusted the
|
|
submodule path of the subdataset in ``.gitmodules`` to point to a published
|
|
subdataset on GitHub:
|
|
|
|
.. runrecord:: _examples/DL-101-147-101
|
|
:language: console
|
|
:workdir: dl-101/DataLad-101
|
|
:emphasize-lines: 9
|
|
|
|
# in DataLad-101
|
|
$ cat .gitmodules
|
|
|
|
Later in this demonstration we would like to publish the subdataset to a
|
|
RIA store and retrieve it automatically from this store -- retrieval is only
|
|
attempted from a store, however, if no other working source is known. Therefore,
|
|
we will remove the reference to the published dataset prior to this
|
|
demonstration and replace it with the path it was originally referenced under.
|
|
|
|
.. runrecord:: _examples/DL-101-147-102
|
|
:language: console
|
|
:workdir: dl-101/DataLad-101
|
|
|
|
# in DataLad-101
|
|
$ datalad subdatasets --contains midterm_project --set-property url ./midterm_project
|
|
|
|
|
|
To demonstrate the basic process, we will create a RIA store on a local file
|
|
system to publish the ``DataLad-101`` dataset from the handbook's "Basics"
|
|
section to. In the example below, the RIA sibling gets the name ``ria-backup``.
|
|
The URL uses the ``file`` protocol and points with an absolute path to the not
|
|
yet existing directory ``myriastore``.
|
|
Make sure that the ``--new-store-ok`` parameter is set to allow the creation of a new store.
|
|
|
|
|
|
.. runrecord:: _examples/DL-101-147-103
|
|
:language: console
|
|
:workdir: dl-101/DataLad-101
|
|
|
|
# inside of the dataset DataLad-101
|
|
$ datalad create-sibling-ria -s ria-backup --alias dl-101 --new-store-ok "ria+file://$HOME/myriastore"
|
|
|
|
Afterwards, the dataset has two additional siblings: ``ria-backup``, and
|
|
``ria-backup-storage``.
|
|
|
|
.. runrecord:: _examples/DL-101-147-104
|
|
:language: console
|
|
:workdir: dl-101/DataLad-101
|
|
|
|
$ datalad siblings
|
|
|
|
The storage sibling is the git-annex ora-remote and is set up automatically --
|
|
unless :dlcmd:`create-sibling-ria` is run with ``--storage-sibling=off``.
|
|
By default, it has the name of the RIA sibling, suffixed with ``-storage``,
|
|
but alternative names can be supplied with the ``--storage-name`` option.
|
|
|
|
|
|
|
|
.. find-out-more:: Take a look into the store
|
|
|
|
Right after running this command, a RIA store has been created in the specified
|
|
location:
|
|
|
|
.. runrecord:: _examples/DL-101-147-105
|
|
:language: console
|
|
:workdir: dl-101/DataLad-101
|
|
|
|
$ tree $HOME/myriastore
|
|
|
|
Note that there is one dataset represented in the RIA store. The two-directory
|
|
structure it is represented under corresponds to the dataset ID of ``DataLad-101``:
|
|
|
|
.. runrecord:: _examples/DL-101-147-106
|
|
:language: console
|
|
:workdir: dl-101/DataLad-101
|
|
|
|
# The dataset ID is stored in .datalad/config
|
|
$ cat .datalad/config
|
|
|
|
In order to publish the dataset's history and all its contents into the RIA store,
|
|
a single :dlcmd:`push` to the RIA sibling suffices:
|
|
|
|
.. runrecord:: _examples/DL-101-147-107
|
|
:language: console
|
|
:workdir: dl-101/DataLad-101
|
|
|
|
$ datalad push --to ria-backup
|
|
|
|
.. find-out-more:: Take another look into the store
|
|
|
|
Now that dataset contents have been pushed to the RIA store, the bare repository
|
|
contains them, although their representation is not human-readable. But worry
|
|
not -- this representation only exists in the RIA store. When cloning this
|
|
dataset from the RIA store, the clone will be in its standard human-readable
|
|
format.
|
|
|
|
.. runrecord:: _examples/DL-101-147-108
|
|
:language: console
|
|
:workdir: dl-101/DataLad-101
|
|
:lines: 1-25, 38-
|
|
|
|
$ tree $HOME/myriastore
|
|
|
|
A second dataset can be added and published to the store in the very same way.
|
|
As a demonstration, we'll do it for the ``midterm_project`` subdataset:
|
|
|
|
.. runrecord:: _examples/DL-101-147-109
|
|
:language: console
|
|
:workdir: dl-101/DataLad-101
|
|
|
|
$ cd midterm_project
|
|
$ datalad create-sibling-ria -s ria-backup ria+file://$HOME/myriastore
|
|
|
|
.. runrecord:: _examples/DL-101-147-110
|
|
:language: console
|
|
:workdir: dl-101/DataLad-101/midterm_project
|
|
|
|
$ datalad push --to ria-backup
|
|
|
|
.. find-out-more:: Take a look into the RIA store after a second dataset has been added
|
|
|
|
With creating a RIA sibling to the RIA store and publishing the contents of
|
|
the ``midterm_project`` subdataset to the store, a second dataset has been
|
|
added to the datastore. Note how it is represented on the same hierarchy
|
|
level as the previous dataset, underneath its dataset ID (note that the output is cut off for readability):
|
|
|
|
|
|
.. runrecord:: _examples/DL-101-147-111
|
|
:language: console
|
|
:workdir: dl-101/DataLad-101/midterm_project
|
|
|
|
$ cat .datalad/config
|
|
|
|
.. runrecord:: _examples/DL-101-147-112
|
|
:language: console
|
|
:workdir: dl-101/DataLad-101
|
|
:lines: 1-25, 38-58
|
|
|
|
$ tree $HOME/myriastore
|
|
|
|
Thus, in order to create and populate RIA stores, only the commands
|
|
:dlcmd:`create-sibling-ria` and :dlcmd:`push` are required.
|
|
|
|
.. index::
|
|
pair: clone; DataLad command
|
|
|
|
Cloning and updating from RIA stores
|
|
""""""""""""""""""""""""""""""""""""
|
|
|
|
Cloning from RIA stores is done via :dlcmd:`clone` from a ``ria+`` URL,
|
|
suffixed with a dataset identifier.
|
|
Depending on the protocol being used, the URLs are composed similarly to during
|
|
sibling creation:
|
|
|
|
- A URL to a RIA store on an :term:`SSH`\-accessible server takes the
|
|
same format as before: ``ria+ssh://[user@]hostname/absolute/path/to/ria-store``
|
|
- A URL to a RIA store on a local file system also looks like during sibling
|
|
creation: ``ria+file:///absolute/path/to/ria-store``
|
|
- A URL for read (without annex) access to a store via :term:`http` (e.g., to a RIA store like
|
|
`store.datalad.org <https://store.datalad.org>`_, through which the
|
|
:ref:`HCP dataset is published <usecase_HCP_dataset>`) looks like this:
|
|
``ria+https://store.datalad.org:/absolute/path/to/ria-store``
|
|
|
|
The appropriate ``ria+`` URL needs to be suffixed with a ``#`` sign and a dataset
|
|
identifier. One way this can be done is via the dataset ID.
|
|
Here is how to clone the ``DataLad-101`` dataset from the RIA store using its
|
|
dataset ID:
|
|
|
|
.. runrecord:: _examples/DL-101-147-120
|
|
:language: console
|
|
:workdir: beyond_basics
|
|
:realcommand: echo "$ datalad clone ria+file://$HOME/myriastore#$(datalad -C $HOME/dl-101/DataLad-101 -f'{infos[dataset][id]}' wtf) myclone" && datalad clone ria+file://$HOME/myriastore#$(datalad -C $HOME/dl-101/DataLad-101 -f'{infos[dataset][id]}' wtf) myclone
|
|
|
|
There are two downsides to this method: For one, it is hard to type, remember, and
|
|
know the dataset ID of a desired dataset. Secondly, if no additional path is given to
|
|
:dlcmd:`clone`, the resulting dataset clone would be named after its ID.
|
|
An alternative, therefore, is to use an *alias* for the dataset. This is an
|
|
alternative dataset identifier that a dataset in a RIA store can be configured
|
|
with - either with a parameter at the time of running ``datalad create-sibling-ria``
|
|
as done above, or manually afterwards. For example, given that the dataset also has
|
|
an alias ``dl-101``, the above call would simplify to
|
|
|
|
.. code-block:: bash
|
|
|
|
$ datalad clone ria+file://$HOME/myriastore#~dl-101
|
|
|
|
.. find-out-more:: Configure an alias for a dataset manually
|
|
|
|
In order to define an alias for an individual dataset in a store, one needs
|
|
to create an ``alias/`` directory in the root of the datastore and place
|
|
a :term:`symlink` of the desired name to the dataset inside of it. Here is how it is
|
|
done, for the midterm project dataset:
|
|
|
|
First, create an ``alias/`` directory in the store, if it doesn't yet exist:
|
|
|
|
.. runrecord:: _examples/DL-101-147-121
|
|
:language: console
|
|
:workdir: beyond_basics
|
|
:realcommand: echo "$ mkdir $HOME/myriastore/alias"
|
|
|
|
|
|
Afterwards, place a :term:`symlink` with a name of your choice to the dataset
|
|
inside of it. Here, we create a symlink called ``midterm_project``:
|
|
|
|
.. runrecord:: _examples/DL-101-147-122
|
|
:language: console
|
|
:workdir: beyond_basics
|
|
:realcommand: echo "$ ln -s $HOME/myriastore/$(datalad -C $HOME/dl-101/DataLad-101/midterm_project -f'{infos[dataset][id]}' wtf | sed 's/^\(...\)\(.*\)/\1\/\2/') $HOME/myriastore/alias/midterm_project" && ln -s $HOME/myriastore/$(datalad -C $HOME/dl-101/DataLad-101/midterm_project -f'{infos[dataset][id]}' wtf | sed 's/^\(...\)\(.*\)/\1\/\2/') $HOME/myriastore/alias/midterm_project
|
|
|
|
Here is how it looks like inside of this directory. You can see both the automatically created alias as well as the newly manually created one:
|
|
|
|
.. runrecord:: _examples/DL-101-147-123
|
|
:language: console
|
|
:workdir: beyond_basics
|
|
|
|
$ tree $HOME/myriastore/alias
|
|
|
|
Afterwards, the alias name, prefixed with a ``~``, can be used as a dataset
|
|
identifier:
|
|
|
|
.. runrecord:: _examples/DL-101-147-124
|
|
:language: console
|
|
:workdir: beyond_basics
|
|
|
|
datalad clone ria+file://$HOME/myriastore#~midterm_project
|
|
|
|
This makes it easier for others to clone the dataset and will provide a sensible
|
|
default name for the clone if no additional path is provided in the command.
|
|
|
|
Note that it is even possible to create "aliases of an aliases" -- symlinking an existing alias-symlink (in the example above ``midterm_project``) under another name in the ``alias/`` directory is no problem.
|
|
This could be useful if the same dataset needs to be accessible via several aliases, or to safeguard against common spelling errors in alias names.
|
|
|
|
The dataset clone is just like any other dataset clone. Contents stored in
|
|
:term:`Git` are present right after cloning, while the contents of annexed files
|
|
is not yet retrieved from the store and can be obtained with a :dlcmd:`get`.
|
|
|
|
.. runrecord:: _examples/DL-101-147-125
|
|
:language: console
|
|
:workdir: beyond_basics
|
|
|
|
$ cd myclone
|
|
$ tree
|
|
|
|
To demonstrate file retrieval from the store, let's get an annexed file:
|
|
|
|
.. runrecord:: _examples/DL-101-147-126
|
|
:language: console
|
|
:workdir: beyond_basics/myclone
|
|
|
|
$ datalad get books/progit.pdf
|
|
|
|
|
|
.. find-out-more:: What about creating RIA stores and cloning from RIA stores with different protocols
|
|
|
|
Consider setting up and populating a RIA store on a server via the ``file``
|
|
protocol, but cloning a dataset from that store to a local computer via
|
|
``SSH`` protocol. Will this be a problem for file content retrieval?
|
|
No, in all standard situations, DataLad will adapt to this. Upon cloning
|
|
the dataset with a different URL than it was created under,
|
|
enabling the special remote will initially fail, but DataLad will adaptive
|
|
try out other URLs (including changes in hostname, path, or protocol) to
|
|
enable the ora-remote and retrieve file contents.
|
|
|
|
Just as expected, the subdatasets are not pre-installed. How will subdataset installation
|
|
work for datasets that exist in a RIA store as well, like ``midterm_project``?
|
|
Just as with any other subdataset! DataLad cleverly handles subdataset
|
|
installations from RIA stores in the background: The location of the subdataset
|
|
in the RIA store is discovered and used automatically:
|
|
|
|
.. runrecord:: _examples/DL-101-147-127
|
|
:language: console
|
|
:workdir: beyond_basics/myclone
|
|
|
|
$ datalad get -n midterm_project
|
|
|
|
More technical insights into the automatic ``ria+`` URL generation are outlined
|
|
in the findoutmore below:
|
|
|
|
.. find-out-more:: On cloning datasets with subdatasets from RIA stores
|
|
|
|
The use case :ref:`usecase_HCP_dataset`
|
|
details a RIA-store based publication of a large dataset, split into a nested
|
|
dataset hierarchy with about 4500 subdatasets in total. But how can links to
|
|
subdatasets work, if datasets in a RIA store are stored in a flat hierarchy,
|
|
with no nesting?
|
|
|
|
The key to this lies in flexibly regenerating subdataset's URLs based on their
|
|
ID and a path to the RIA store. The :dlcmd:`get` command is
|
|
capable of generating RIA URLs to subdatasets on its own, if the higher level
|
|
dataset contains a ``datalad get`` configuration on ``subdataset-source-candidate-origin``
|
|
that points to the RIA store the subdataset is published in. Here is how the
|
|
``.datalad/config`` configuration looks like for the top-level dataset of the
|
|
`HCP dataset <https://github.com/datalad-datasets/human-connectome-project-openaccess>`_::
|
|
|
|
[datalad "get"]
|
|
subdataset-source-candidate-origin = "ria+https://store.datalad.org#{id}"
|
|
|
|
With this configuration, a :dlcmd:`get` can use the URL and insert
|
|
the dataset ID in question into the ``{id}`` placeholder to clone directly
|
|
from the RIA store.
|
|
|
|
This configuration is automatically added to a dataset that is cloned from a
|
|
RIA store, but it can also be done by hand with a :gitcmd:`config`
|
|
command [#f7]_.
|
|
|
|
|
|
Beyond straightforward access to datasets, RIA stores also allow very fine-grained
|
|
cloning operations: Datasets in RIA stores can be cloned in specific versions.
|
|
|
|
.. find-out-more:: Cloning specific dataset versions
|
|
|
|
Optionally, datasets can be cloned in a specific version, such as a :term:`tag`
|
|
or :term:`branch` by appending ``@<version-identifier>`` after the dataset ID
|
|
or the dataset alias.
|
|
Here is how to clone the `BIDS <https://bids.neuroimaging.io>`_ version of the
|
|
`structural preprocessed subset of the HCP dataset <https://github.com/datalad-datasets/hcp-structural-preprocessed>`_
|
|
that exists on the branch ``bids`` of this dataset:
|
|
|
|
.. code-block:: bash
|
|
|
|
$ datalad clone ria+https://store.datalad.org#~hcp-structural-preprocessed@bids
|
|
|
|
If you are interested in finding out how this dataset came into existence,
|
|
checkout the use case :ref:`usecase_HCP_dataset`.
|
|
|
|
Updating datasets works with the :dlcmd:`update` and :dlcmd:`update --merge`
|
|
commands introduced in chapter :ref:`chapter_collaboration`. And because a
|
|
RIA store hosts :term:`bare Git repositories`, collaborating becomes
|
|
easy. Anyone with access can clone the dataset from the store, add changes, and
|
|
push them back -- this is the same workflow as for datasets hosted on sites such
|
|
as :term:`GitHub`, :term:`GitLab`, or :term:`Gin`.
|
|
|
|
Permission management
|
|
"""""""""""""""""""""
|
|
|
|
In order to limit access or give access to datasets in datastores, permissions can be set
|
|
at the time of RIA sibling creation with the ``--shared`` option.
|
|
If it is given, this option configures the permissions in the RIA store for
|
|
multi-users access. Possible values for this option are identical to those of
|
|
``git init --shared`` and are described in its
|
|
`documentation <https://git-scm.com/docs/git-init#Documentation/git-init.txt---sharedfalsetrueumaskgroupallworldeverybodyltpermgt>`__.
|
|
In order for the dataset to be accessible to everyone, for example, ``--shared all``
|
|
could be specified. If access should be limited to a particular Unix
|
|
`group <https://en.wikipedia.org/wiki/File-system_permissions#Notation_of_traditional_Unix_permissions>`_
|
|
(``--shared group``), the group name needs to be specified with the
|
|
``--group`` option.
|
|
|
|
|
|
Configurations and tricks to hide technical layers
|
|
""""""""""""""""""""""""""""""""""""""""""""""""""
|
|
|
|
In setups with a central, DataLad-centric data management, in order to spare
|
|
users knowing about RIA stores, custom configurations can
|
|
be distributed via DataLad's run-procedures to simplify workflows further and
|
|
hide the technical layers of the RIA setup. For example, custom procedures provided
|
|
at dataset creation could automatically perform a sibling setup in a RIA store,
|
|
and also create an associated GitLab repository with a publication dependency to
|
|
the RIA store to ease publishing data or cloning the dataset.
|
|
The use case :ref:`usecase_datastore` details the setup of RIA stores in a
|
|
scientific institute and demonstrates this example.
|
|
|
|
To simplify repository access beyond using aliases, the datasets stored in a RIA
|
|
store can be installed under human-readable names in a single superdataset.
|
|
Cloning the superdataset exposes the underlying datasets under a non-dataset-ID name.
|
|
Users can thus get data from datasets hosted in a datastore without any
|
|
knowledge about the dataset IDs or the need to construct ``ria+`` URLs, just as
|
|
it was done in the usecases :ref:`usecase_HCP_dataset` and :ref:`usecase_datastore`.
|
|
From a user's perspective, the RIA store would thus stay completely hidden.
|
|
|
|
Standard maintenance tasks by data stewards with knowledge about RIA stores and
|
|
access to it can be performed easily or even in an automated fashion. The
|
|
use case :ref:`usecase_datastore` showcases some examples of those operations.
|
|
|
|
Summary
|
|
^^^^^^^
|
|
|
|
RIA stores are useful, lean, and undemanding storage locations for DataLad datasets.
|
|
Their properties make them suitable solutions to back-up, central data management,
|
|
or collaboration use cases. They can be set up with minimal effort, and the few
|
|
technical details a user may face such as cloning from :term:`dataset ID`\s can
|
|
be hidden with minimal configurations of the store like aliases or custom
|
|
procedures.
|
|
|
|
|
|
.. rubric:: Footnotes
|
|
|
|
.. [#f1] The two-level structure (3 ID characters as one subdirectory, the
|
|
remaining ID characters as the next subdirectory) exists to avoid exhausting
|
|
file system limits on the number of files/folders within a directory.
|
|
|
|
.. [#f2] Beyond datasets, the RIA store only contains the directory ``error_logs``
|
|
for error logging and the file ``ria-layout-version`` for a specification of the
|
|
dataset tree layout in the store (last two lines in the code block above).
|
|
The ``ria-layout-version`` is important because it identifies whether
|
|
the keystore uses git-annex's ``hashdirlower`` (git-annex's default for
|
|
bare repositories) or ``hashdirmixed`` layout (which is necessary to
|
|
allow symlinked annexes, relevant for :term:`ephemeral clone`\s). To read
|
|
more about hashing in the key store, take a look at
|
|
`the docs <https://git-annex.branchable.com/internals/hashing>`_.
|
|
|
|
.. [#f3] To re-read about how git-annex's object tree works, check out section
|
|
:ref:`symlink`, and pay close attention to the :ref:`Find-out-more on the object tree <objecttree>`.
|
|
Additionally, you can find a lot of background information in git-annex's
|
|
`documentation <https://git-annex.branchable.com/internals>`_.
|
|
|
|
.. [#f4] The usecase
|
|
|
|
.. todo::
|
|
|
|
Link UKBiobank on supercomputer use case once ready
|
|
|
|
shows how this feature can come in handy.
|
|
|
|
.. [#f6] To re-read about publication dependencies and why this is relevant to
|
|
annexed contents in the dataset, checkout section :ref:`sharethirdparty`.
|
|
|
|
.. [#f7] To re-read on configuring datasets with the :gitcmd:`config`, go
|
|
back to sections :ref:`config` and :ref:`config2`.
|