datalad-handbook/docs/usecases/datastorage_for_institutions.rst

.. index:: ! 3-001
.. index:: ! Usecase; Remote Indexed Archive (RIA) store
.. _3-001:
.. _usecase_datastore:

Building a scalable data storage for scientific computing
---------------------------------------------------------

Research can require enormous amounts of data. Such data needs to be accessed by
multiple people at the same time, and is used across a diverse range of
computations or research questions.
The size of the dataset, the need for simultaneous access and transformation
of this data by multiple people, and the subsequent storing of multiple copies
or derivatives of the data constitutes a challenge for computational clusters
and requires state-of-the-art data management solutions.
This use case details a model implementation for a scalable data storage
solution, suitable to serve the computational and logistic demands of data
science in big (scientific) institutions, while keeping workflows for users
as simple as possible. It elaborates on

#. How to implement a scalable :term:`Remote Indexed Archive (RIA) store` to flexibly
   store large amounts of DataLad datasets, potentially remote to lower storage
   strains on computing infrastructure,
#. How disk-space aware computing can be eased by DataLad based workflows and
   enforced by infrastructural incentives and limitations, and
#. How to reduce technical complexities for users and encourage reproducible,
   version-controlled, and scalable scientific workflows.

.. importantnote:: Use case target audience

   This use case is technical in nature and aimed at IT/data management
   personnel seeking insights into the technical implementation and
   configuration of a RIA store or into its workflows. In particular, it
   describes the RIA data storage and workflow implementation as done in INM-7,
   Research Centre Juelich, Germany.


The Challenge
^^^^^^^^^^^^^

The data science institute XYZ consists of dozens of people: Principal
investigators, PhD students, general research staff, system administration,
and IT support. It does research on important global issues, and prides
itself on ground-breaking insights obtained from elaborate and complex
computations run on a large scientific computing cluster.
The datasets used in the institute are big both in size and number of files,
and expensive to collect.
Therefore, datasets are used for various different research questions, by
multiple researchers. Every member of the institute has an account on an expensive
and large compute cluster, and all of the data exists in dedicated directories
on this server. However, researchers struggle with the technical overhead of
data management *and* data science.
In order to work on their research questions without modifying
original data, every user creates their own copies of the full data in their
user account on the cluster -- even if it contains many files that are not
necessary for their analysis. In addition, as version control is not a standard
skill, they add all computed derivatives and outputs, even old versions, out of
fear of losing work that may become relevant again. Thus, an excess of (unorganized)
data copies and derivatives exists in addition to the already substantial
amount of original data. At the same time, the compute cluster is both the
data storage and the analysis playground for the institute. With data
directories of several TB in size, *and* computationally heavy analyses, the
compute cluster is quickly brought to its knees: Insufficient memory and
IOPS starvation make computations painstakingly slow, and hinder scientific
progress. Despite the elaborate and expensive cluster setup, exciting datasets
cannot be stored or processed, as there just doesn't seem to be enough disk
space.

Therefore, the challenge is two-fold: On an infrastructural level, institute XYZ
needs a scalable, flexible, and maintainable data storage solution for their
growing collection of large datasets.
On the level of human behavior, researchers not formally trained in data
management need to apply and adhere to advanced data management principles.

The DataLad approach
^^^^^^^^^^^^^^^^^^^^

The compute cluster is refurbished to a state-of-the-art data management
system.
For a scalable and flexible dataset storage, the data store is a
:term:`Remote Indexed Archive (RIA) store` -- an extendable, file-system based
storage solution for DataLad datasets that aligns well with the requirements of
scientific computing (infrastructure).
The RIA store is configured as a git-annex ORA-remote ("optional remote archive")
special remote for access to annexed keys in the store and so that full
datasets can be (compressed) 7-zip archives.
The latter is especially useful in case of file system inode
limitations, such as on HPC storage systems: Regardless of a dataset's number of
files and size, (compressed) 7zipped datasets use only few inodes, but retain the
ability to query available files.
Unlike traditional solutions, both because of the size of the large
amounts of data, and for more efficient use of compute power for
calculations instead of data storage, the RIA store is set up *remote*: Data is
stored on a different machine than the one the scientific analyses are computed
on. While unconventional, it is convenient, and perfectly possible with DataLad.

The infrastructural changes are accompanied by changes in the mindset and
workflows of the researchers that perform analyses on the cluster.
By using a RIA store, the institute's work routines are adjusted around
DataLad datasets. Simple configurations, distributed system-wide with DataLad's
run-procedures, or basic data management principles improve the efficiency and
reproducibility of research projects:
Analyses are set-up inside of DataLad datasets, and for every
analysis, an associated ``project`` is created under the namespace of the
institute on the institute's :term:`GitLab` instance automatically. This
not only leads to vastly simplified version control workflows, but also to
simplified access to projects and research logs for collaborators and supervisors.
Input data gets installed as subdatasets from the RIA store. This automatically
links analysis projects to data sets, and allows for fine-grained access of up
to individual file level. With only precisely needed data, analysis datasets are
already much leaner than with previous complete dataset copies, but as data can
be reobtained on-demand from the store, original input files or files that are
easily recomputed can safely be dropped to save even more disk-space.
Beyond this, upon creation of an analysis project, the associated GitLab project
is automatically configured as a remote with a publication dependency on the
data store, thus enabling vastly simplified data publication routines and
backups of pristine results: After computing their results, a
:dlcmd:`push` is all it takes to backup and share one's scientific
insights. Thus, even with a complex setup of data store, compute infrastructure,
and repository hosting, configurations adjusted to the compute infrastructure
can be distributed and used to mitigate any potential remaining technical overhead.
Finally, with all datasets stored in a RIA store and in a single place, any remaining
maintenance and query tasks in the datasets can be performed by data management
personnel without requiring domain knowledge about dataset contents.


Step-by-step
^^^^^^^^^^^^

The following section will elaborate on the details of the technical
implementation of a RIA store, and the workflow requirements and incentives for
researchers. Both of them are aimed at making scientific analyses on a
compute cluster scale and can be viewed as complementary but independent.

.. importantnote:: Note on the generality of the described setup

   Some hardware-specific implementation details are unique to the real-world
   example this use case is based on, and are not a requirement. In this particular
   case of application, for example, a *remote* setup for a RIA store made sense:
   Parts of an old compute cluster and of the super computer at the Juelich
   Supercomputing Centre (JSC) instead of the institute's compute cluster are used
   to host the data store. This may be an unconventional storage location,
   but it is convenient: The data does not strain the compute cluster, and with
   DataLad, it is irrelevant where the RIA store is located. The next subsection
   introduces the general layout of the compute infrastructure and some
   DataLad-unrelated incentives and restrictions.

Incentives and imperatives for disk-space aware computing
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""

On a high level, the layout and relationships of the relevant computational
infrastructure in this use case are as follows:
Every researcher has a workstation that they can access the compute cluster with.
On the compute cluster's head node, every user account has their own
home directory. These are the private spaces of researchers and are referred to
as ``$HOME`` in :numref:`fig_store`.
Analyses should be conducted on the cluster's compute nodes (``$COMPUTE``).
``$HOME`` and ``$COMPUTE`` are not managed or trusted by data management personnel,
and are seen as *ephemeral* (short-lived).
The RIA store (``$DATA``) can be accessed both from ``$HOME`` and ``$COMPUTE``,
in both directions: Researchers can pull datasets from the store, push new
datasets to it, or update (certain) existing datasets. ``$DATA`` is the one location
in which experienced data management personnel ensures back-up and archival, performs
house-keeping, and handles :term:`permissions`, and is thus where pristine raw
data is stored or analysis code or results from ``$COMPUTE`` and ``$HOME`` should
end up. This aids organization, and allows a central management of back-ups
and archival, potentially by data stewards or similar data management personnel
with no domain knowledge about data contents.

.. _fig_store:

.. figure:: ../artwork/src/ephemeral_infra.svg
   :alt: A simple, local version control workflow with datalad.
   :figwidth: 80%

   Trinity of research data handling: The data store (``$DATA``) is managed and
   backed-up. The compute cluster (``$COMPUTE``) has an analysis-appropriate structure
   with adequate resources, but just as users workstations/laptops (``$HOME``),
   it is not concerned with data hosting.

One aspect of the problem are disk-space unaware computing workflows. Researchers
make and keep numerous copies of data in their home directory and perform
computationally expensive analyses on the headnode of a compute cluster because
they do not know better, and/or want to do it in the easiest way possible.
A general change for the better can be achieved by imposing sensible limitations
and restrictions on what can be done at which scale:
Data from the RIA store (``$DATA``) is accessible to researchers for exploration
and computation, but the scale of the operations they want to perform can require
different approaches.
In their ``$HOME``, researchers are free to do whatever they want as long as it
is within the limits of their machines or their user accounts (100GB). Thus,
researchers can explore data, test and develop code, or visualize results,
but they cannot create complete dataset copies or afford to keep an excess of
unused data around.
Only ``$COMPUTE`` has the necessary hardware requirements for expensive computations.
Thus, within ``$HOME``, researchers are free to explore data
as they wish, but scaling requires them to use ``$COMPUTE``. By using a job
scheduler, compute jobs of multiple researchers are distributed fairly across
the available compute infrastructure. Version controlled (and potentially
reproducible) research logs and the results of the analyses can be pushed from
``COMPUTE`` to ``$DATA`` for back-up and archival, and hence anything that is
relevant for a research project is tracked, backed-up, and stored, all without
straining available disk-space on the cluster afterwards. While the imposed
limitations are independent of DataLad, DataLad can make sure that the necessary
workflows are simple enough for researchers of any seniority, background, or
skill level.

Remote indexed archive (RIA) stores
"""""""""""""""""""""""""""""""""""

A RIA store is a storage solution for DataLad datasets that can be flexibly
extended with new datasets, independent of static file names or directory
hierarchies, and that can be (automatically) maintained or queried without
requiring expert or domain knowledge about the data. At its core, it is a flat,
file-system based repository representation of any number of datasets, limited
only by disk-space constrains of the machine it lies on.

.. index::
   pair: create-sibling-ria;  DataLad command

Put simply, a RIA store is a dataset storage location that allows for access to
and collaboration on DataLad datasets.
The high-level workflow overview is as follows: Create a dataset,
use the :dlcmd:`create-sibling-ria` command to establish a connection
to an either pre-existing or not-yet-existing RIA store, publish dataset contents
with :dlcmd:`push`, (let others) clone the dataset from the
RIA store, and (let others) publish and pull updates. In the
case of large, institute-wide datasets, a RIA store (or multiple RIA stores)
can serve as a central storage location that enables fine-grained data access to
everyone who needs it, and as a storage and back-up location for all analysis datasets.
Beyond constituting central storage locations, RIA stores also ease dataset
maintenance and queries:
If all datasets of an institute are kept in a single RIA store, questions such
as "Which projects use this data as their input?", "In which projects was the
student with this Git identity involved?", "Give me a complete research log
of what was done for this publication", or "Which datasets weren't used in the
last 5 years?" can be answered automatically with Git tools, without requiring
expert knowledge about the contents of any of the datasets, or access to the
original creators of the dataset.
To find out more about RIA stores, check out section :ref:`riastore`.

.. todo::

   Add a paragraph on the setup in INM-7 once it exists (bulk nodes, project-wise
   RIA stores, stores in home directories, etc.

RIA store workflows
"""""""""""""""""""

.. todo::

   Sketch a RIA store workflow from a user's perspective

**Configurations can hide the technical layers**

Setting up a RIA store and appropriate siblings is fairly easy -- it requires
only the :dlcmd:`create-sibling-ria` command.
However, in the institute this use case describes, in order to spare users
knowing about RIA stores, custom configurations are distributed via DataLad's
run-procedures to simplify workflows further and hide the technical layers of
the RIA setup:

A `custom procedure <https://jugit.fz-juelich.de/inm7/infrastructure/inm7-datalad/blob/master/inm7_datalad/resources/procedures/cfg_inm7.py>`_
performs the relevant sibling setup with a fully configured link to the RIA store,
and, on top of it, also creates an associated repository with a publication
dependency on the RIA store to an institute's GitLab instance [#f1]_.
With a procedure like this in place system-wide, an individual researcher only
needs to call the procedure right at the time of dataset creation, and has a
fully configured and set up analysis dataset afterwards:

.. code-block:: bash

   $ datalad create -c inm7 <PATH>

Working in this dataset will require only :dlcmd:`save` and
:dlcmd:`push` commands, and configurations ensure that the projects
history and results are published where they need to be: The RIA store, for storing
and archiving the project including data, and GitLab, for exposing the projects
progress to the outside and easing collaboration or supervision. Users do not need
to know the location of the store, its layout, or how it works -- they can go
about doing their science, while DataLad handles publications routines.

In order to get input data from datasets hosted in the datastore without requiring
users to know about dataset IDs or construct ``ria+`` URLs, superdatasets
get a :term:`sibling` on :term:`GitLab` or :term:`GitHub` with a human readable
name. Users can clone the superdatasets from the web hosting service, and obtain data
via :dlcmd:`get`. A concrete example for this is described in
the use case :ref:`usecase_HCP_dataset`. While :dlcmd:`get` will retrieve file
or subdataset contents from the RIA store, users will not need to bother where
the data actually comes from.

Summary
"""""""

The infrastructural and workflow changes around DataLad datasets in RIA stores
improve the efficiency of the institute:

With easy local version control workflows and DataLad-based data management routines,
researchers are able to focus on science and face barely any technical overhead for
data management. As file content for analyses is obtained *on demand*
via :dlcmd:`get`, researchers selectively obtain only those data they
need instead of having complete copies of datasets as before, and thus save disk
space. Upon :dlcmd:`push`, computed results and project histories
can be pushed to the data store and the institute's GitLab instance, and be thus
backed-up and accessible for collaborators or supervisors. Easy-to-reobtain input
data can safely be dropped to free disk space on the compute cluster. Sensible
incentives for computing and limitations on disk space prevent unmanaged clutter.
With a RIA store full of bare git repositories, it is easily maintainable by data
stewards or system administrators. Common compression or cleaning operations of
Git and git-annex are performed without requiring knowledge about the data
inside of the store, as are queries on interesting aspects of datasets, potentially
across all of the datasets of the institute.
With a remote data store setup, the compute cluster is efficiently used for
computations instead of data storage. Researchers can not only compute their
analyses faster and on larger datasets than before, but with DataLad's version
control capabilities their work also becomes more transparent, open, and
reproducible.


.. rubric:: Footnotes

.. [#f1] To re-read about DataLad's run-procedures, check out section
         :ref:`procedures`. You can find the source code of the procedure
         `on GitLab <https://jugit.fz-juelich.de/inm7/infrastructure/inm7-datalad/blob/master/inm7_datalad/resources/procedures/cfg_inm7.py>`_.