datalad-handbook/docs/basics/101-132-advancednesting.rst

.. index::
   pair: dataset nesting; DataLad concept
.. _nesting2:

More on dataset nesting
^^^^^^^^^^^^^^^^^^^^^^^

You may have noticed how working in the subdataset felt as if you would be
working in an independent dataset -- there was no information or influence at
all from the top-level ``DataLad-101`` superdataset, and you build up a
completely stand-alone history:

.. runrecord:: _examples/DL-101-132-101
   :language: console
   :workdir: dl-101/DataLad-101/midterm_project

   $ git log --oneline

In principle, this is no news to you. From section :ref:`nesting` and the
YODA principles you already know that nesting allows for a modular reuse of
any other DataLad dataset, and that this reuse is possible and simple
precisely because all of the information is kept within a (sub)dataset.

What is new now, however, is that you applied changes to the dataset. While
you already explored the looks and feels of the ``longnow`` subdataset in
previous sections, you now *modified* the contents of the ``midterm_project``
subdataset.
How does this influence the superdataset, and how does this look like in the
superdataset's history? You know from section :ref:`nesting` that the
superdataset only stores the *state* of the subdataset. Upon creation of the
dataset, the very first, initial state of the subdataset was thus recorded in
the superdataset. But now, after you finished your project, your subdataset
evolved. Let's query the superdataset what it thinks about this.

.. runrecord:: _examples/DL-101-132-102
   :language: console
   :workdir: dl-101/DataLad-101/midterm_project

   $ # move into the superdataset
   $ cd ../
   $ datalad status

From the superdataset's perspective, the subdataset appears as being
"modified". Note how it is not individual files that show up as "modified", but
indeed the complete subdataset as a single entity.

What this shows you is that the modifications of the subdataset you performed are not
automatically recorded to the superdataset. This makes sense, after all it
should be up to you to decide whether you want record something or not.
But it is worth repeating: If you modify a subdataset, you will need to save
this *in the superdataset* in order to have a clean superdataset status.

Let's save the modification of the subdataset into the history of the
superdataset. For this, to avoid confusion, you can specify explicitly to
which dataset you want to save a modification. ``-d .`` specifies the current
dataset, i.e., ``DataLad-101``, as the dataset to save to:

.. runrecord:: _examples/DL-101-132-103
   :language: console
   :workdir: dl-101/DataLad-101/

   $ datalad save -d . -m "finished my midterm project" midterm_project

.. index::
   pair: save modification in nested dataset; with DataLad
.. find-out-more:: More on how 'datalad save' can operate on nested datasets

   In a superdataset with subdatasets, :dlcmd:`save` by default
   tries to figure out on its own which dataset's history of all available
   datasets a :dlcmd:`save` should be written to. However, it can reduce
   confusion or allow specific operations to be very explicit in the command
   call and tell DataLad where to save what kind of modifications to.

   If you want to save the current state of the subdataset into the superdataset
   (as necessary here), start a ``save`` from the superdataset and have the
   ``-d/--dataset`` option point to its root:

   .. code-block:: console

      $ # in the root of the superds
      $ datalad save -d . -m "update subdataset"

   If you are in the superdataset, and you want to save an unsaved modification
   in a subdataset to the *subdatasets* history, let ``-d/--dataset`` point to
   the subdataset:

   .. code-block:: console

      $ # in the superds
      $ datalad save -d path/to/subds -m "modified XY"

   The recursive option allows you to save any content underneath the specified
   directory, and recurse into any potential subdatasets:

   .. code-block:: console

      $ datalad save . --recursive

Let's check which subproject commit is now recorded in the superdataset:

.. runrecord:: _examples/DL-101-132-104
   :language: console
   :workdir: dl-101/DataLad-101/
   :emphasize-lines: 14

   $ git log -p -n 1

As you can see in the log entry, the subproject commit changed from the
first commit hash in the subdataset history to the most recent one. With this
change, therefore, your superdataset tracks the most recent version of
the ``midterm_project`` dataset, and your dataset's status is clean again.


This time in DataLad-101 is a convenient moment to dive a bit deeper
into the functions of the :dlcmd:`status` command. If you are
interested in this, checkout the :ref:`dedicated Findoutmore <fom-status>`.

.. index::
   pair: status; DataLad command
   pair: check dataset for modification; with DataLad
.. find-out-more:: More on 'datalad status'
   :name: fom-status
   :float:

   First of all, let's start with a quick overview of the different content *types*
   and content *states* various :dlcmd:`status` commands in the course
   of DataLad-101 have shown up to this point.
   You have seen the following *content types*:

   - ``file``, e.g., ``notes.txt``: any file (or symlink that is a placeholder to an annexed file)
   - ``directory``, e.g., ``books``: any directory that does not qualify for the ``dataset`` type
   - ``symlink``, e.g., the ``.jgp`` that was manually unlocked in section :ref:`run3`:
     any symlink that is not used as a placeholder for an annexed file
   - ``dataset``, e.g., the ``midterm_project``: any top-level dataset, or any subdataset
     that is properly registered in the superdataset

   And you have seen the following *content states*: ``modified`` and ``untracked``.
   The section :ref:`file system` will show you many instances of ``deleted`` content
   state as well.

   But beyond understanding the report of :dlcmd:`status`, there is also
   additional functionality:
   :dlcmd:`status` can handle status reports for a whole hierarchy
   of datasets, and it can report on a subset of the content across any number of
   datasets in this hierarchy by providing selected paths. This is useful as soon
   as datasets become more complex and contain subdatasets with changing contents.

   When performed without any arguments, :dlcmd:`status` will report
   the state of the current dataset. However, you can specify a path to any
   sub- or superdataset with the ``--dataset`` option.
   In order to demonstrate this a bit better, we will make sure that not only the
   state of the subdataset *within* the superdataset is modified, but also that the
   subdataset contains a modification. For this, let's add an empty text file into
   the ``midterm_project`` subdataset:

   .. runrecord:: _examples/DL-101-132-105
      :language: console
      :workdir: dl-101/DataLad-101

      $ touch midterm_project/an_empty_file

   If you are in the root of ``DataLad-101``, but interested in the status
   *within* the subdataset, simply provide a path (relative to your current location)
   to the command:

   .. runrecord:: _examples/DL-101-132-106
      :language: console
      :workdir: dl-101/DataLad-101

      $ datalad status midterm_project

   Alternatively, to achieve the same, specify the superdataset as the ``--dataset``
   and provide a path to the subdataset *with a trailing path separator* like
   this:

   .. runrecord:: _examples/DL-101-132-107
      :language: console
      :workdir: dl-101/DataLad-101

      $ datalad status -d . midterm_project/

   Note that both of these commands return only the ``untracked`` file and not
   not the ``modified`` subdataset because we're explicitly querying only the
   subdataset for its status.
   If you however, as done outside of this Find-out-more, you want to know about
   the subdataset record in the superdataset without causing a status query for
   the state *within* the subdataset itself, you can also provide an explicit
   path to the dataset (without a trailing path separator). This can be used
   to specify a specific subdataset in the case of a dataset with many subdatasets:

   .. runrecord:: _examples/DL-101-132-108
      :language: console
      :workdir: dl-101/DataLad-101

      $ datalad status -d . midterm_project


   But if you are interested in both the state within the subdataset, and
   the state of the subdataset within the superdataset, you can combine the
   two paths:

   .. runrecord:: _examples/DL-101-132-109
      :language: console
      :workdir: dl-101/DataLad-101

      $ datalad status -d . midterm_project midterm_project/

   Finally, if these subtle differences in the paths are not easy to memorize,
   the ``-r/--recursive`` option will also report you both status aspects:

   .. runrecord:: _examples/DL-101-132-110
      :language: console
      :workdir: dl-101/DataLad-101

      $ datalad status --recursive

   Importantly, the regular output from a :dlcmd:`status` command in the commandline is "condensed" to the most important information by a tailored result renderer.
   You can, however, also get ``status``' unfiltered full output by switching the ``-f``/``--output-format`` from ``tailored`` (the default) to ``json`` or, for the same infos as ``json`` but better readability, ``json_pp``:

   .. runrecord:: _examples/DL-101-132-111
      :language: console
      :workdir: dl-101/DataLad-101

      $ datalad -f json_pp status -d . midterm_project

   This still was not all of the available functionality of the
   :dlcmd:`status` command. You could, for example, adjust whether and
   how untracked dataset content should be reported with the ``--untracked``
   option, or get additional information from annexed content with the ``--annex``
   option (especially powerful when combined with ``-f json_pp``). To get a complete overview on what you could do, check out the technical
   documentation of :dlcmd:`status` `here <https://docs.datalad.org/en/latest/generated/man/datalad-status.html>`_.

   Before we leave this Find-out-more, lets undo the modification of the subdataset
   by removing the untracked file:

   .. runrecord:: _examples/DL-101-132-112
      :language: console
      :workdir: dl-101/DataLad-101

      $ rm midterm_project/an_empty_file
      $ datalad status --recursive

.. only:: adminmode

    Add a tag at the section end.

      .. runrecord:: _examples/DL-101-132-113
         :language: console
         :workdir: dl-101/DataLad-101

         $ git branch sct_more_on_dataset_nesting