252 lines
12 KiB
ReStructuredText
252 lines
12 KiB
ReStructuredText
.. index:: ! Usecase; Using Globus as data store
|
||
.. _usecase_using_globus_as_datastore:
|
||
|
||
Using Globus as a data store for the Canadian Open Neuroscience Portal
|
||
----------------------------------------------------------------------
|
||
|
||
This use case shows how the `Canadian Open Neuroscience Portal (CONP) <https://conp.ca>`_
|
||
disseminates data as DataLad datasets using the `Globus <https://www.globus.org>`_
|
||
network with :term:`git-annex`, a custom git-annex :term:`special remote`, and
|
||
Datalad. It demonstrates
|
||
|
||
#. How to enable the git-annex `Globus special remote <https://github.com/CONP-PCNO/git-annex-remote-globus>`_
|
||
to access files content from `Globus.org <https://www.globus.org>`_,
|
||
#. The workflows used to access datasets via the
|
||
`Canadian Open Neuroscience Portal (CONP) <https://conp.ca>`_,
|
||
#. An example of disk-space aware computing with large datasets distributed
|
||
across systems that avoids unnecessary replication, eased by DataLad and
|
||
:term:`git-annex`.
|
||
|
||
The Challenge
|
||
^^^^^^^^^^^^^
|
||
|
||
Every day, researchers from different fields strive to advance present
|
||
state-of-the-art scientific knowledge by generating and publishing novel
|
||
results. Crucially, they must share such results with the scientific
|
||
community to enable other researchers to further build on existing data
|
||
and avoid duplicating work.
|
||
|
||
The `Canadian Open Neuroscience Portal (CONP) <https://conp.ca>`_ is a publicly
|
||
available platform that aims to remove the technical barriers to practicing open science
|
||
and improve the accessibility and reusability of neuroscience research to accelerate
|
||
the pace of discovery. To this end, the platform will provide a unified interface
|
||
that -- among other things -- enables sharing and open dissemination of both neuroscience
|
||
data and methods to the global community.
|
||
Managing the scientific data ecosystem is extremely challenging given
|
||
the amount of new data generated every day, however.
|
||
CONP must take a strategic solution to allow researchers to
|
||
|
||
- dynamically work on present data,
|
||
- upload new versions of the data, and
|
||
- generate additional scientific work.
|
||
|
||
An underlying data management system to achieve this must be flexible, dynamic
|
||
and light-weight. It would need to have the ability to easily distribute datasets
|
||
across multiple locations to reduce the need of re-collecting or replicating
|
||
data that is similar to already existing datasets.
|
||
|
||
The Datalad Approach
|
||
^^^^^^^^^^^^^^^^^^^^
|
||
|
||
CONP makes use of Datalad as a data management tool to enable efficient analysis
|
||
and work on datasets: Datalad minimizes the computational cost of holding full storage of
|
||
datasets versions, it allows files in a dataset to be distributed across
|
||
multiple download sources, and to be retrieved on demand only to save disk space.
|
||
Therefore, it is common practice for researchers to both download and
|
||
publish research content in a dataset format via a CONP, which provides them
|
||
with a vast dataset repository.
|
||
|
||
.. find-out-more:: Basic principles of DataLad for new readers
|
||
|
||
If you are new to DataLad, the introduction of the handbook and the chapter
|
||
:ref:`chapter_datasets` can give you a good idea of what DataLad and its
|
||
underlying tools can to, as well as a hands-on demonstration. This findoutmore,
|
||
in the meantime, sketches a high-level overview of the principles behind DataLad's
|
||
data sharing capacities.
|
||
|
||
Datalad is built on top of `Git <https://git-scm.com>`_ and
|
||
`git-annex <https://git-annex.branchable.com>`_, and enables data version
|
||
control. A one-page overview can be found in section :ref:`executive_summary`.
|
||
|
||
:term:`git-annex` is a useful tool that extends Git with the ability to manage
|
||
repositories in a lightweight fashion even if they contain large amounts of
|
||
data. One main principle of git-annex lies storing data that should not be
|
||
stored in Git (e.g., due to size limits) in an :term:`annex`. In its place, it
|
||
generates symbolic links (:term:`symlink`\s) to these *annexed* files that encode
|
||
their file content. Only the symlinks are committed into :term:`Git` while
|
||
:term:`git-annex` handles data management in the annex. A detailed explanation
|
||
of this process can be found in the section :ref:`symlink`, but the outcome
|
||
of it is a light-weight Git repository that can be cloned fast and yet contains
|
||
access to arbitrarily large data managed by :term:`git-annex`.
|
||
|
||
In the case of data sharing procedures, annexed data can be stored in various
|
||
third party hosting services configured as
|
||
`special remotes <https://git-annex.branchable.com/special_remotes>`_.
|
||
When retrieving data, :term:`git-annex` requests access to the primary data
|
||
source storing those files to retrieve actual files content when the user
|
||
needs it.
|
||
|
||
The workflows for users to get data are straightforward:
|
||
Users log into the CONP portal and install Datalad datasets with
|
||
``datalad install -r <dataset>``. This gives them access to the annexed files
|
||
(as mentioned in the findoutmore above, large files replaced by their symlinks).
|
||
To request the content of the annexed files, they simply download those files
|
||
locally in their file system using ``datalad get path/to/file``. So simple!
|
||
|
||
On a technical level, under the hood, :term:`git-annex` needs to have a connection
|
||
established with the primary data source, the :term:`special remote`, that hosts
|
||
and provides the requested files' contents.
|
||
In some cases, annexed files are stored in `Globus.org <https://www.globus.org>`__.
|
||
Globus is an efficient transfer files system suitable for researchers to share
|
||
and transfer files between so called *endpoints*, locations in Globus.org where
|
||
files get uploaded by their owners or get transferred to, that can be either
|
||
private or public. Annexed file contents are stored in such
|
||
`Globus endpoints <https://docs.globus.org/faq/globus-connect-endpoints/#what_is_an_endpoint>`_.
|
||
Therefore, when users download annexed files, Globus communicates with git-annex
|
||
to provide access to files content. Given this functionality, we can say that
|
||
Globus works as a data store for git-annex, or in technical terms, that Globus is
|
||
configured to work as a :term:`special remote` for git-annex. This is
|
||
possible via the git-annex backend interface implementation for Globus
|
||
called `git-annex-globus-remote <https://github.com/CONP-PCNO/git-annex-remote-globus>`_
|
||
developed by CONP.
|
||
In conjunction, CONP and the git-annex-globus-remote constitute the building
|
||
blocks that enable access to datasets and its data: CONP hosts small-sized
|
||
datasets, and Globus.org is the data store that (large) file content can be
|
||
retrieved from.
|
||
|
||
To sum up, CONP makes a variety of datasets available and provides them to researchers
|
||
as Datalad datasets that have the regular, advantageous Datalad functionality.
|
||
All of this exists thanks to the ability of git-annex and Datalad to interface with
|
||
special remote locations across the web such as `Globus.org <https://www.globus.org>`__
|
||
to request access to data.
|
||
In this way, researchers have access to a wide research data ecosystem and can use
|
||
and reuse existing data, thus reducing the need of data replication.
|
||
|
||
|
||
|
||
Step-by-Step
|
||
^^^^^^^^^^^^
|
||
|
||
Globus as git-annex data store
|
||
""""""""""""""""""""""""""""""
|
||
A remote data store exists thanks to git-annex (which DataLad builds upon):
|
||
git-annex uses a key-value pair to reference files. In the git-annex object tree,
|
||
large files in datasets are stored as values while the key is generated from their
|
||
contents and is checked into Git. The key is used to reference the location of the value
|
||
in the object tree [#f1]_. The :term:`object-tree` (or keystore) with the data contents can
|
||
be located anywhere – its location only needs to be encoded using a special remote.
|
||
Therefore, thanks to the `git-annex-globus-remote <https://github.com/CONP-PCNO/git-annex-remote-globus>`_
|
||
interface, Globus.org provides git-annex with location information to retrieve
|
||
values and access files content with the corresponding keys.
|
||
To ultimately enable end users’ access to data,
|
||
git-annex registers Globus locations by assigning them to Globus-specific URLs,
|
||
such as ``globus://dataset_id/path/to/file``. Each Globus URL is associated
|
||
with a the key corresponding to the given file. The use of a Globus URL protocol
|
||
is a fictitious mean to assign each file of the dataset a unique location and
|
||
source and therefore, it is a wrapper for additional validation that is performed
|
||
by the git-annex-globus-remote to check on the actual presence of the file within
|
||
the Globus transfer file ecosystem. In other words, the ‘Globus URL’ is simply an
|
||
alias of an existing file located on the web and specifically available in Globus.org.
|
||
Registration of Globus URLs in git-annex is among the configuration procedures
|
||
carried out on an administrative, system-wide level, and users will only deal
|
||
with direct easy access of desired files.
|
||
|
||
With this, Globus is configured to receive data access requests from git-annex
|
||
and to respond back if data is available. Currently, the git-annex-globus-remote
|
||
only supports data *download* operations. In the future, it could be useful for
|
||
additional functionality as well.
|
||
When the globus special remote gets initialized for the first time, the user
|
||
has to authenticate to Globus.org using `ORCID <https://orcid.org>`_ ,
|
||
`Gmail <https://mail.google.com>`_ or a specific Globus account.
|
||
This step will enable git-annex to then initialize the globus special remote and
|
||
establish the communication process. Instructions to use the globus special remote
|
||
are available at `github.com/CONP-PCNO/git-annex-remote-globus <https://github.com/CONP-PCNO/git-annex-remote-globus>`_.
|
||
Guidelines specifying the standard communication protocol to implement a custom
|
||
special remote can be found at
|
||
`git-annex.branchable.com/design/external_special_remote_protocol <https://git-annex.branchable.com/design/external_special_remote_protocol>`_.
|
||
|
||
|
||
An example using Globus from a user perspective
|
||
"""""""""""""""""""""""""""""""""""""""""""""""
|
||
It always starts with a dataset, installed with either :dlcmd:`install`
|
||
or :dlcmd:`clone`.
|
||
|
||
.. code-block:: bash
|
||
|
||
$ datalad install -r <dataset>
|
||
$ cd <dataset>
|
||
|
||
In order to get access to annexed data stored on Globus.org, users need to
|
||
install the globus-special-remote. If it is the first time using
|
||
Globus, users will need to authenticate to Globus.org by running the
|
||
``git-annex-remote-globus setup`` command:
|
||
|
||
.. code-block:: bash
|
||
|
||
$ pip install git-annex-remote-globus
|
||
# if first time
|
||
$ git-annex-remote-globus setup
|
||
|
||
After the installation of a dataset, we can see that most of the files in the
|
||
dataset are annexed: Listing a file with ``ls -l`` will reveal a :term:`symlink`
|
||
to the dataset's annex.
|
||
|
||
.. code-block:: bash
|
||
|
||
$ ls -l NeuroMap_data/cortex/mask/mask.mat
|
||
cortex/mask/mask.mat -> ../../../.git/annex/objects/object.mat
|
||
|
||
However, without having any content downloaded yet, the symlink currently points
|
||
into a void, and tools will not be able to open the file as its contents
|
||
are not yet locally available.
|
||
|
||
.. code-block:: bash
|
||
|
||
$ cat NeuroMap_data/cortex/mask/mask.mat
|
||
NeuroMap_data/cortex/mask/mask.mat: No such file or directory
|
||
|
||
However, data retrieval is easy. At first, users have to enable the globus remote.
|
||
|
||
.. code-block:: bash
|
||
|
||
$ git annex enableremote globus
|
||
enableremote globus ok
|
||
(recording state in git...)
|
||
|
||
After that, they can download any file, directory, or complete dataset using
|
||
:dlcmd:`get`:
|
||
|
||
.. code-block:: bash
|
||
|
||
$ datalad get NeuroMap_data/cortex/mask/mask.mat
|
||
get(ok): NeuroMap_data/cortex/mask/mask.mat (file) [from globus...]
|
||
|
||
$ ls -l NeuroMap_data/cortex/mask/mask.mat
|
||
cortex/mask/mask.mat -> ../../../.git/annex/objects/object.mat
|
||
|
||
$ cat NeuroMap_data/cortex/mask/mask.mat
|
||
# you can now access the file !
|
||
|
||
|
||
Downloaded! Researchers could now use this dataset to replicate previous analyses
|
||
and further build on present data to bring scientific knowledge forward.
|
||
CONP thus makes a variety of datasets flexibly available and helps to disseminate
|
||
data. The on-demand availability of files in datasets can help scientists to
|
||
save disk space. For this, they could get only those data files that they need
|
||
instead of obtaining complete copies of the dataset, or they could locally
|
||
:dlcmd:`drop` data that is hosted and thus easily re-available on Globus.org
|
||
after their analyses are done.
|
||
|
||
|
||
Resources
|
||
^^^^^^^^^
|
||
|
||
The ``README`` at `github.com/CONP-PCNO/git-annex-remote-globus <https://github.com/CONP-PCNO/git-annex-remote-globus>`_
|
||
provides an excellent and in-depth overview of how to install and use
|
||
the git-annex special remote for Globus.org.
|
||
|
||
|
||
.. rubric:: Footnotes
|
||
|
||
.. [#f1] More details on how :term:`git-annex` handles data underneath the hood and
|
||
how the :term:`object-tree` works can be found in section :ref:`symlink`.
|