datalad-handbook/docs/basics/101-124-procedures.rst

.. index:: ! procedures, run-procedures
.. _procedures:

Configurations to go
--------------------

The past two sections should have given you a comprehensive
overview on the different configuration options the tools
Git, git-annex, and DataLad provide. They not only
showed you a way to configure everything you may need to
configure, but also gave explanations about what the
configuration options actually mean.

But figuring out which configurations are useful and how
to apply them are also not the easiest tasks. Therefore,
some clever people decided to assist with
these tasks, and created pre-configured *procedures*
that process datasets in a particular way.
These procedures can be shipped within DataLad or its extensions,
lie on a system, or can be shared together with datasets.

One of such procedures is the ``text2git`` configuration.
In order to learn about procedures in general, let's demystify
what the ``text2git`` procedure exactly is: It is
nothing more than a simple script that

- writes the relevant ``annex_largefiles`` configuration,  i.e., "Do not put anything that is a text file in the annex") to the ``.gitattributes`` file of a dataset, and
- saves this modification with the commit message "Instruct annex to add text files to Git".

This particular procedure lives in a script called
``cfg_text2git`` in the sourcecode of DataLad. The amount of code
in this script is not large, and the relevant lines of code
are highlighted:

.. code-block:: python
   :emphasize-lines: 12, 16-17

    import sys
    import os.path as op

    from datalad.distribution.dataset import require_dataset

    ds = require_dataset(
        sys.argv[1],
        check_installed=True,
        purpose='configuration')

    # the relevant configuration:
    annex_largefiles = '((mimeencoding=binary)and(largerthan=0))'
    attrs = ds.repo.get_gitattributes('*')
    if not attrs.get('*', {}).get(
            'annex.largefiles', None) == annex_largefiles:
        ds.repo.set_gitattributes([
            ('*', {'annex.largefiles': annex_largefiles})])

    git_attributes_file = op.join(ds.path, '.gitattributes')
    ds.save(
        git_attributes_file,
        message="Instruct annex to add text files to Git",
    )

Just like ``cfg_text2git``, all DataLad procedures are
executables (such as a script, or compiled code).
In principle, they can be written in any language, and perform
any task inside of a dataset.
The ``text2git`` configuration, for example, applies a configuration for how
git-annex treats different file types. Other procedures do not
only modify ``.gitattributes``, but can also populate a dataset
with particular content, or automate routine tasks such as
synchronizing dataset content with certain siblings.
What makes them a particularly versatile and flexible tool is
that anyone can write their own procedures.
If a workflow is a standard in a team and needs to be applied often, turning it into
a script can save time and effort.
To learn how to do this, read the :ref:`tutorial on writing own procedures <fom-procedures>`.
By pointing DataLad to the location the procedures reside in they can be applied, and by
including them in a dataset they can even be shared.
And even if the script is simple, it is very handy to have preconfigured
procedures that can be run in a single command line call. In the
case of ``text2git``, all text files in a dataset will be stored
in Git -- this is a useful configuration that is applicable to a
wide range of datasets. It is a shortcut that
spares naive users the necessity to learn about the ``.gitattributes``
file when setting up a dataset.


.. index::
   pair: run-procedure; DataLad command
   pair: discover dataset procedures; with DataLad
   pair: discover; dataset procedure

To find out available procedures, the command
:dlcmd:`run-procedure --discover` is helpful.
This command will make DataLad search the default location for
procedures in a dataset, the source code of DataLad or
installed DataLad extensions, and the default locations for
procedures on the system for available procedures:

.. runrecord:: _examples/DL-101-124-101
   :workdir: dl-101/DataLad-101
   :language: console

   $ datalad run-procedure --discover

The output shows that four procedures available in this particular dataset and the system it exists on:
``cfg_metadatatypes``, ``cfg_text2git``, ``cfg_yoda``, and ``cfg_noannex``.
It also lists where they are stored -- in this case,
they are all part of the source code of DataLad [#f1]_.

- ``cfg_noannex`` configures a dataset to not have an annex at all.
- ``cfg_yoda`` configures a dataset according to the yoda
  principles -- the section :ref:`yoda` talks about this in detail.
- ``cfg_text2git`` configures text files to be stored in Git.
- ``cfg_metadatatypes`` lets users configure additional metadata
  types.

.. index::
   pair: run dataset procedure; with DataLad
   pair: run; dataset procedure

Applying procedures
^^^^^^^^^^^^^^^^^^^

:dlcmd:`run-procedure` not only *discovers*
but also *executes* procedures. If given the name of
a procedure, this command will apply the procedure to
the current dataset, or the dataset that is specified
with the ``-d/--dataset`` flag:

.. code-block:: bash

   datalad run-procedure [-d <PATH>] cfg_text2git

.. index::
   pair: run dataset procedure on dataset creation; with DataLad
   pair: run on dataset creation; dataset procedure

The typical workflow is to create a dataset and apply
a procedure afterwards.
However, some procedures shipped with DataLad or its extensions with a
``cfg_`` prefix can also be applied right at the creation of a dataset
with the ``-c/--cfg-proc <name>`` option in a :dlcmd:`create`
command. This is a peculiarity of these procedures because, by convention,
all of these procedures are written to not require arguments.
The command structure looks like this:

.. code-block:: console

   $ datalad create -c text2git DataLad-101

Note that the ``cfg_`` prefix of the procedures is omitted in these
calls to keep it extra simple and short. The
available procedures in this example (``cfg_yoda``, ``cfg_text2git``)
could thus be applied within a :dlcmd:`create` as

- ``datalad create -c yoda <DSname>``
- ``datalad create -c text2git <DSname>``

.. index:: dataset procedure; apply more than one configuration
.. find-out-more:: Applying multiple procedures

   If you want to apply several configurations at once, feel free to do so,
   for example like this:

   .. code-block:: console

      $ datalad create -c yoda -c text2git

.. index:: dataset procedure; apply to subdatasets
.. find-out-more:: Applying procedures in subdatasets

   Procedures can be applied in datasets on any level in the dataset hierarchy, i.e.,
   also in subdatasets. Note, though, that a subdataset will show up as being
   ``modified`` in :dlcmd:`status` *in the superdataset*
   after applying a procedure.
   This is expected, and it would also be the case with any other modification
   (saved or not) in the subdataset, as the version of the subdataset that is tracked
   in the superdataset simply changed. A :dlcmd:`save` in the superdataset
   will make sure that the version of the subdataset gets updated in the superdataset.
   The section :ref:`nesting2` will elaborate on this general principle later in the
   handbook.

As a general note, it can be useful to apply procedures
early in the life of a dataset. Procedures such
as ``cfg_yoda``, explained in detail in section :ref:`yoda`,
create files, change ``.gitattributes``, or apply other configurations.
If many other (possibly complex) configurations are
already in place, or if files of the same name as the ones created by
a procedure are already in existence, this can lead to unexpected
problems or failures, especially for naive users. Applying ``cfg_text2git``
to a default dataset in which one has saved many text files already
(as per default added to the annex) will not place the existing, saved
files into Git -- only those text files created *after* the configuration
was applied.

.. index::
   single: configuration item; datalad.locations.system-procedures
   single: configuration item; datalad.locations.user-procedures
   single: configuration item; datalad.locations.dataset-procedures
   single: configuration item; datalad.procedures.<name>.call-format
   single: configuration item; datalad.procedures.<name>.help
   single: datasets procedures; write your own
.. find-out-more:: Write your own procedures
   :name: fom-procedures
   :float:

   Procedures can come with DataLad or its extensions, but anyone can
   write their own ones in addition, and deploy them on individual machines,
   or ship them within DataLad datasets. This allows to
   automate routine configurations or tasks in a dataset, or share configurations that would otherwise not "stick" to the dataset.
   Here are some general rules for creating a custom procedure:

   - A procedure can be any executable. Executables must have the
     appropriate permissions and, in the case of a script,
     must contain an appropriate :term:`shebang`.

       - If a procedure is not executable, but its filename ends with
         ``.sh``, it is automatically executed via :term:`bash`.

   - Procedures can implement any argument handling, but must be capable
     of taking at least one positional argument (the absolute path to the
     dataset they shall operate on).

   - Custom procedures rely heavily on configurations in ``.datalad/config``
     (or the associated environment variables). Within ``.datalad/config``,
     each procedure should get an individual entry that contains at least
     a short "help" description on what the procedure does. Below is a minimal
     ``.datalad/config`` entry for a custom procedure:

     .. code-block:: ini

        [datalad "procedures.<NAME>"]
           help = This is a string to describe what the procedure does

   - By default, on GNU/Linux systems, DataLad will search for system-wide procedures
     (i.e., procedures on the *system* level) in ``/etc/xdg/datalad/procedures``,
     for user procedures (i.e., procedures on the *global* level) in ``~/.config/datalad/procedures``,
     and for dataset procedures (i.e., the *local* level [#f2]_) in ``.datalad/procedures``
     relative to a dataset root.
     Note that ``.datalad/procedures`` does not exist by default, and the ``procedures``
     directory needs to be created first.

       - Alternatively to the default locations, DataLad can be pointed to the
         location of a procedure with a configuration in ``.datalad/config``
         (or with the help of the associated :term:`environment variable`\s).
         The appropriate configuration keys for ``.datalad/config`` are either
         ``datalad.locations.system-procedures`` (for changing the *system* default),
         ``datalad.locations.user-procedures`` (for changing the *global* default),
         or ``datalad.locations.dataset-procedures`` (for changing the *local* default).
         An example ``.datalad/config`` entry for the local scope is shown below.

         .. code-block:: ini

            [datalad "locations"]
                dataset-procedures = relative/path/from/dataset-root

    - By default, DataLad will call a procedure with a standard template
      defined by a format string:

      .. code-block::

         interpreter {script} {ds} {arguments}

      where arguments can be any additional command line arguments a script
      (procedure) takes or requires. This default format string can be
      customized within ``.datalad/config`` in ``datalad.procedures.<NAME>.call-format``.
      An example ``.datalad/config`` entry with a changed call format string
      is shown below.

      .. code-block:: ini

         [datalad "procedures.<NAME>"]
            help = This is a string to describe what the procedure does
            call-format = python {script} {ds} {somearg1} {somearg2}

    - By convention, procedures should leave a dataset in a clean state.

   Therefore, in order to create a custom procedure, an executable script
   in the appropriate location is fine. Placing a script ``myprocedure``
   into ``.datalad/procedures`` will allow running
   ``datalad run-procedure myprocedure`` in your dataset, and because
   it is part of the dataset it will also allow distributing the procedure.
   Below is a toy-example for a custom procedure:

   .. runrecord:: _examples/DL-101-124-103
      :language: console
      :workdir: procs

      $ datalad create somedataset; cd somedataset

   .. runrecord:: _examples/DL-101-124-104
      :language: console
      :workdir: procs/somedataset

      $ mkdir .datalad/procedures
      $ cat << EOT > .datalad/procedures/example.py
      """A simple procedure to create a file 'example' and store
      it in Git, and a file 'example2' and annex it. The contents
      of 'example' must be defined with a positional argument."""

      import sys
      import os.path as op
      from datalad.distribution.dataset import require_dataset
      from datalad.utils import create_tree

      ds = require_dataset(
          sys.argv[1],
          check_installed=True,
          purpose='showcase an example procedure')

      # this is the content for file "example"
      content = """\
      This file was created by a custom procedure! Neat, huh?
      """

      # create a directory structure template. Write
      tmpl = {
          'somedir': {
              'example': content,
          },
          'example2': sys.argv[2] if sys.argv[2] else "got no input"
      }

      # actually create the structure in the dataset
      create_tree(ds.path, tmpl)

      # rule to store 'example' Git
      ds.repo.set_gitattributes([('example', {'annex.largefiles': 'nothing'})])

      # save the dataset modifications
      ds.save(message="Apply custom procedure")

      EOT

   .. runrecord:: _examples/DL-101-124-105
      :language: console
      :workdir: procs/somedataset

      $ datalad save -m "add custom procedure"

   At this point, the dataset contains the custom procedure ``example``.
   This is how it can be executed and what it does:

   .. runrecord:: _examples/DL-101-124-106
      :language: console
      :workdir: procs/somedataset

      $ datalad run-procedure example "this text will be in the file 'example2'"

   .. runrecord:: _examples/DL-101-124-107
      :language: console
      :workdir: procs/somedataset

      $ # the directory structure has been created
      $ tree

   .. runrecord:: _examples/DL-101-124-108
      :workdir: procs/somedataset
      :language: console

      $ # lets check out the contents in the files
      $ cat example2  && echo '' && cat somedir/example

   .. runrecord:: _examples/DL-101-124-109
      :workdir: procs/somedataset
      :language: console

      $ git config -f .datalad/config datalad.procedures.example.help "A toy example"
      $ datalad save -m "add help description"

   To find out more about a given procedure, you can ask for help:

   .. runrecord:: _examples/DL-101-124-110
      :workdir: procs/somedataset
      :language: console

      $ datalad run-procedure --help-proc example


Summing up, DataLad's :dlcmd:`run-procedure` command is a handy tool
with useful existing procedures but much flexibility for your own
DIY procedure scripts. With the information of the last three sections
you should be able to write and understand necessary configurations,
but you can also rely on existing, preconfigured templates in the
form of procedures, and even write and distribute your own.

Therefore, envision procedures as
helper-tools that can minimize technical complexities
in a dataset -- users can concentrate on the actual task while
the dataset is set-up, structured, processed, or configured automatically
with the help of a procedure.
Especially in the case of trainees and new users, applying procedures
instead of doing relevant routines "by hand" can help to ease
working with the dataset. Other than by users, procedures can also be triggered to automatically
run after any command execution if a command results matches a specific
requirement.

Finally, make a note about running procedures inside of ``notes.txt``:

.. runrecord:: _examples/DL-101-124-111
   :language: console
   :workdir: dl-101/DataLad-101

   $ cat << EOT >> notes.txt
   It can be useful to use pre-configured procedures that can apply
   configurations, create files or file hierarchies, or perform arbitrary
   tasks in datasets. They can be shipped with DataLad, its extensions,
   or datasets, and you can even write your own procedures and distribute
   them.
   The "datalad run-procedure" command is used to apply such a procedure
   to a dataset. Procedures shipped with DataLad or its extensions
   starting with a "cfg" prefix can also be applied at the creation of a
   dataset with "datalad create -c <PROC-NAME> <PATH>" (omitting the
   "cfg" prefix).

   EOT

.. runrecord:: _examples/DL-101-124-112
   :workdir: dl-101/DataLad-101
   :language: console

   $ datalad save -m "add note on DataLad's procedures"


.. only:: adminmode

    Add a tag at the section end.

      .. runrecord:: _examples/DL-101-124-112
         :language: console
         :workdir: dl-101/DataLad-101

         $ git branch sct_configurations_to_go


.. rubric:: Footnotes

.. [#f1] In theory, because procedures can exist on different levels, and
         because anyone can create (and thus name) their own procedures, there
         can be name conflicts. The order of precedence in such cases is:
         user-level, system-level, dataset, DataLad extension, DataLad, i.e.,
         local procedures take precedence over those coming from "outside" via
         datasets or DataLad extensions.
         If procedures in a higher-level dataset and a subdataset have the same
         name, the procedure closer to the dataset ``run-procedure`` is
         operating on takes precedence.

.. [#f2] Note that we simplify the level of procedures that exist within a dataset
         by calling them *local*. Even though they apply to a dataset just as *local*
         Git configurations, unlike Git's *local* configurations in ``.git/config``,
         the procedures and procedure configurations in ``.datalad/config`` are committed
         and can be shared together with a dataset. The procedure level *local* therefore
         does not exactly corresponds to the *local* scope in the sense that Git uses it.