datalad-handbook/docs/beyond_basics/101-149-copyfile.rst

.. _copyfile:

Subsample datasets using datalad copy-file
------------------------------------------

If there is a need for a dataset that contains only a subset of files of one or more other dataset, it can be helpful to create subsamples special-purpose datasets with the :dlcmd:`copy-file` command.
This command is capable of transferring files from different datasets or locations outside of a dataset into a new dataset, unlocking them if necessary, and preserving and copying their availability information.
As such, the command is a superior, albeit more technical alternative to :ref:`copying dereferenced files out of datasets <copydata>`.

This section demonstrates the command based on a published data, a subset of the Human Connectome Project dataset that is subsampled for structural connectivity analysis.
This dataset can be found on GitHub at `github.com/datalad-datasets/hcp-structural-connectivity <https://github.com/datalad-datasets/hcp-structural-connectivity>`_.

Copy-file in action with the HCP dataset
""""""""""""""""""""""""""""""""""""""""

Consider a real-life example: A large number of scientists use the `human connectome project (HCP) dataset <https://github.com/datalad-datasets/human-connectome-project-openaccess>`_ for `structural connectivity analyses <https://en.wikipedia.org/wiki/Brain_connectivity_estimators>`_.
This dataset contains data from more than 1000 subjects, and exceeds 80 million files.
As such, as explained in more detail in the chapter :ref:`chapter_gobig`, it is split up into a hierarchy of roughly 4500 subdatasets [#f1]_.
The installation of all subdatasets takes around 90 minutes, if parallelized, and a complete night if performed serially.
However, for a structural connectivity analysis, only eleven files per subject are relevant::

  - <sub>/T1w/Diffusion/nodif_brain_mask.nii.gz
  - <sub>/T1w/Diffusion/bvecs
  - <sub>/T1w/Diffusion/bvals
  - <sub>/T1w/Diffusion/data.nii.gz
  - <sub>/T1w/Diffusion/grad_dev.nii.gz
  - <sub>/unprocessed/3T/T1w_MPR1/*_3T_BIAS_32CH.nii.gz
  - <sub>/unprocessed/3T/T1w_MPR1/*_3T_AFI.nii.gz
  - <sub>/unprocessed/3T/T1w_MPR1/*_3T_BIAS_BC.nii.gz
  - <sub>/unprocessed/3T/T1w_MPR1/*_3T_FieldMap_Magnitude.nii.gz
  - <sub>/unprocessed/3T/T1w_MPR1/*_3T_FieldMap_Phase.nii.gz
  - <sub>/unprocessed/3T/T1w_MPR1/*_3T_T1w_MPR1.nii.gz

In order to spare others the time and effort to install thousands of subdatasets, a one-time effort can create and publish a subsampled, single dataset of those files using the :dlcmd:`copy-file` command.

.. index::
   pair: copy-file; DataLad command

:dlcmd:`copy-file` is able to copy files with their availability metadata into other datasets.
The content of the files does not need to be retrieved in order to do this.
Because the subset of relevant files is small, all structural connectivity related files can be copied into a single dataset.
This speeds up the installation time significantly, and reduces the confusion that the concept of subdatasets can bring to DataLad novices.
The result is a dataset with a subset of files (following the original directory structure of the HCP dataset), created reproducibly with complete provenance capture.
Access to the files inside of the subsampled dataset works via valid AWS credentials just as it does for the full dataset [#f1]_.

The Basics of copy-file
^^^^^^^^^^^^^^^^^^^^^^^

This short demonstration gives an overview of the functionality of :dlcmd:`copy-file` - Feel free to follow along by copy-pasting the commands into your terminal.
Let's start by cloning a dataset to work with:

.. runrecord:: _examples/DL-101-149-01
   :language: console
   :workdir: beyond_basics/HPC

   $ datalad clone https://github.com/datalad-datasets/human-connectome-project-openaccess.git hcp

In order to use :dlcmd:`copy-file`, we need to install a few subdatasets, and thus install 9 subject subdatasets recursively.
Note that we don't retrieve any data, using ``-n``/``--no-data``.
(The output of this command is omitted -- it is quite lengthy as 36 subdatasets are being installed)

.. runrecord:: _examples/DL-101-149-02
   :language: console
   :workdir: beyond_basics/HPC
   :lines: 1-3

   $ cd hcp
   $ datalad get -n -r HCP1200/130*

Afterwards, we can create a new dataset to copy any files into.
This dataset will later hold the relevant subset of the data in the HCP dataset.

.. runrecord:: _examples/DL-101-149-03
   :language: console
   :workdir: beyond_basics/HPC/hcp

   $ cd ..
   $ datalad create dataset-to-copy-to

With the prerequisites set up, we can start to copy files.
The command :dlcmd:`copy-file` works as follows:
By providing a path to a file to be copied (which can be annex'ed, not annex'ed, or not version-controlled at all) and either a second path (the destination path), a target directory inside of a dataset, or a dataset specification, :dlcmd:`copy-file` copies the file and all of its availability metadata into the specified dataset.
Let's copy a single file (``hcp/HCP1200/130013/T1w/Diffusion/bvals``) from the ``hcp`` dataset into ``dataset-to-copy-to``:

.. runrecord:: _examples/DL-101-149-04
   :language: console
   :workdir: beyond_basics/HPC

   $ datalad copy-file \
      hcp/HCP1200/130013/T1w/Diffusion/bvals  \
      -d dataset-to-copy-to

When the ``-d/--dataset`` argument is provided instead of a target directory or a destination path, the copied file will be `saved` in the new dataset.
If a target directory or a destination path is given for a file, however, the copied file will be not be saved:

.. runrecord:: _examples/DL-101-149-05
   :language: console
   :workdir: beyond_basics/HPC

   $ datalad copy-file \
      hcp/HCP1200/130013/T1w/Diffusion/bvecs \
      -t dataset-to-copy-to

Note that instead of a as dataset, we specify it as a target path, and how the file is added, but not saved afterwards:

.. runrecord:: _examples/DL-101-149-06
   :language: console
   :workdir: beyond_basics/HPC

   $ cd dataset-to-copy-to
   $ datalad status

Providing a second path as a `destination` path allows one to copy the file under a different name, but it will also not save the new file in the destination dataset unless ``-d/--dataset`` is specified as well:

.. runrecord:: _examples/DL-101-149-07
   :language: console
   :workdir: beyond_basics/HPC

   $ datalad copy-file \
      hcp/HCP1200/130013/T1w/Diffusion/bvecs \
      dataset-to-copy-to/anothercopyofbvecs

.. runrecord:: _examples/DL-101-149-08
   :language: console
   :workdir: beyond_basics/HPC

   $ cd dataset-to-copy-to
   $ datalad status

Those were the minimal basics of the command syntax - the original location, a specification where the file should be copied to, and an indication if the file should be saved or not.
Let's save those two unsaved files:

.. runrecord:: _examples/DL-101-149-09
   :language: console
   :workdir: beyond_basics/HPC/dataset-to-copy-to

   $ datalad save

With the ``-r/--recursive`` flag enabled, the command can copy complete *subdirectory* (not subdataset!) hierarchies -- Let's copy a complete directory, and save it in its target dataset:

.. runrecord:: _examples/DL-101-149-10
   :language: console
   :workdir: beyond_basics/HPC/hcp

   $ cd ..
   $ datalad copy-file hcp/HCP1200/130114/T1w/Diffusion/* \
    -r \
    -d dataset-to-copy-to \
    -t dataset-to-copy-to/130114/T1w/Diffusion

Here is how the dataset that we copied files into looks like at the moment:

.. runrecord:: _examples/DL-101-149-11
   :language: console
   :workdir: beyond_basics/HPC

   $ tree dataset-to-copy-to

Importantly, all of the copied files had yet unretrieved contents.
The copy-file process, however, also copied the files' availability metadata to their new location.
Retrieving file contents works just as it would in the full HCP dataset via :dlcmd:`get` (the authentication step is omitted in the output below):

.. runrecord:: _examples/DL-101-149-12
   :language: console
   :workdir: beyond_basics/HPC

   $ cd dataset-to-copy-to
   $ datalad get bvals anothercopyofbvecs 130114/T1w/Diffusion/eddylogs/eddy_unwarped_images.eddy_parameters

What's especially helpful for automation of this operation is that :dlcmd:`copy-file` can take source and (optionally) destination paths from a file or from :term:`stdin` with the option ``--specs-from <source>``.
In the case of specifications from a file, ``<source>`` is a path to this file.

In order to use ``stdin`` for specification, such as the output of a ``find`` command that is piped into :dlcmd:`copy-file` with a `Unix pipe (|) <https://en.wikipedia.org/wiki/Pipeline_(Unix)>`_, ``<source>`` needs to be a dash (``-``). Below is an example ``find`` command:

.. runrecord:: _examples/DL-101-149-13
   :language: console
   :workdir: beyond_basics/HPC

   $ cd hcp
   $ find HCP1200/130013/T1w/ -maxdepth 1 -name T1w*.nii.gz

This uses ``find`` to get a list of all files matching the specified pattern in the specified directory.
And here is how the outputted paths can be given as source paths to :dlcmd:`copy-file`, copying all of the found files into a new dataset:

.. runrecord:: _examples/DL-101-149-14
   :language: console
   :workdir: beyond_basics/HPC/hcp

   # inside of hcp
   $ find HCP1200/130013/T1w/ -maxdepth 1 -name T1w*.nii.gz \
     | datalad copy-file -d ../dataset-to-copy-to --specs-from -

To preserve the directory structure, a target directory (``-t ../dataset-to-copy-to/130013/T1w/``) or a destination path could be given, because the above command copied all files into the root of ``dataset-to-copy-to``:

.. runrecord:: _examples/DL-101-149-15
   :language: console
   :workdir: beyond_basics/HPC/hcp

   $ ls ../dataset-to-copy-to

With this trick, you can use simple search commands to assemble a list of files as a ``<source>`` for :dlcmd:`copy-file`: simply create a file or a command like ``find`` that specifies tho relevant files or directories line-wise.
``--specs-from`` can take information on both ``<source>`` and ``<destination>``, though.


Specify files with source AND destination paths for --specs-from
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Specifying source *and* destination paths comes with a twist: Source and destination paths need to go into the same line, but need to be separated by a `nullbyte <https://en.wikipedia.org/wiki/Null_character>`_.
This is not a straightforward concept, but trying it out and seeing it in action will help.

One way it can be done is by using the stream editor :term:`sed`.
Here is how to pipe source AND destination paths into :dlcmd:`copy-file`:

 .. code-block:: bash

	$ find HCP1200/130518/T1w/ -maxdepth 1 -name T1w*.nii.gz \
	  | sed -e 's#\(HCP1200\)\(.*\)#\1\2\x0../dataset-to-copy-to\2#' \
	  | datalad copy-file -d ../dataset-to-clone-to -r --specs-from -

As always, the regular expressions used for sed are a bit hard to grasp upon first sight.
Here is what this command does:

- In general, :term:`sed`\'s :command:`s` (substitute) command will take a string specified between the first set of ``#``\'s (``\(HCP1200\)\(.*\)``) and replace it with what is between the second and third ``#`` (``\1\2\x0\2``).
- The first part splits the paths ``find`` returns (such as ``HCP1200/130518/T1w/T1w_acpc_dc.nii.gz``) into two groups:

   - The start of the path (``HCP1200``), and
   - the remaining path (``/130518/T1w/T1w_acpc_dc.nii.gz``).

   - The second part then prints the first and the second group (``\1\2``, the source path), a nullbyte (``\x0``), and a relative path to the destination dataset together with the second group only (``../dataset-to-copy-to\2``, the destination path).

Here is how an output of ``find`` piped into ``sed`` looks like:

.. runrecord:: _examples/DL-101-149-16
   :language: console
   :workdir: beyond_basics/HPC/hcp

   $ find HCP1200/130518/T1w -maxdepth 1 -name T1w*.nii.gz \
	 | sed -e 's#\(HCP1200\)\(.*\)#\1\2\x0../dataset-to-copy-to\2#'

Note how the nullbyte is not visible to the naked eye in the output.
To visualize it, you could redirect this output into a file and open it with an editor like :term:`vim`.
Let's now see a :dlcmd:`copy-file` from :term:`stdin` in action:

.. runrecord:: _examples/DL-101-149-17
   :language: console
   :workdir: beyond_basics/HPC/hcp

   $ find HCP1200/130518/T1w -maxdepth 1 -name T1w*.nii.gz \
    | sed -e 's#\(HCP1200\)\(.*\)#\1\2\x0../dataset-to-copy-to\2#' \
    | datalad copy-file -d ../dataset-to-copy-to -r --specs-from -

Done!
A complex looking command with regular expressions and unix pipes, but it does powerful things in only a single line.

Copying reproducibly
^^^^^^^^^^^^^^^^^^^^

To capture the provenance of subsampled dataset creation, the :dlcmd:`copy-file` command can be wrapped into a :dlcmd:`run` call.
Here is a sketch how it was done in the structural connectivity subdataset:

**Step 1:** Create a dataset

.. code-block:: bash

   $ datalad create hcp-structural-connectivity

**Step 2:** Install the full dataset as a subdataset

.. code-block:: bash

   $ datalad clone -d . \
     https://github.com/datalad-datasets/human-connectome-project-openaccess.git \
     .hcp

**Step 3:** Install all subdataset of the full dataset with ``datalad get -n -r``

**Step 4:** Inside of the new dataset, draft a ``find`` command that returns all 11 desired files, and a subsequent ``sed`` substitution command that returns a nullbyte separated source and destination path.
For this subsampled dataset, this one would work::

   $ find .hcp/HCP1200  -maxdepth 5 -path '*/unprocessed/3T/T1w_MPR1/*' -name '*' \
    -o -path '*/T1w/Diffusion/*' -name 'b*' \
    -o -path '*/T1w/Diffusion/*' -name '*.nii.gz' \
    | sed -e 's#\(\.hcp/HCP1200\)\(.*\)#\1\2\x00.\2#' \

**Step 5:** Pipe the results into :dlcmd:`copy-file`, and wrap everything into a :dlcmd:`run`.
Note that ``-d/--dataset`` is not specified for :dlcmd:`copy-file` -- this way, :dlcmd:`run` will save everything in one go at the end.

.. code-block:: bash

   $ datalad run \
     -m "Assemble HCP dataset subset for structural connectivity data. \

	Specifically, these are the files:

    - T1w/Diffusion/nodif_brain_mask.nii.gz
	- T1w/Diffusion/bvecs
	- T1w/Diffusion/bvals
	- T1w/Diffusion/data.nii.gz
	- T1w/Diffusion/grad_dev.nii.gz
	- unprocessed/3T/T1w_MPR1/*_3T_BIAS_32CH.nii.gz
	- unprocessed/3T/T1w_MPR1/*_3T_AFI.nii.gz
	- unprocessed/3T/T1w_MPR1/*_3T_BIAS_BC.nii.gz
	- unprocessed/3T/T1w_MPR1/*_3T_FieldMap_Magnitude.nii.gz
	- unprocessed/3T/T1w_MPR1/*_3T_FieldMap_Phase.nii.gz
	- unprocessed/3T/T1w_MPR1/*_3T_T1w_MPR1.nii.gz

	for each participant. The structure of the directory tree and file names
	are kept identical to the full HCP dataset." \
	"find .hcp/HCP1200  -maxdepth 5 -path '*/unprocessed/3T/T1w_MPR1/*' -name '*' \
	  -o -path '*/T1w/Diffusion/*' -name 'b*' \
	  -o -path '*/T1w/Diffusion/*' -name '*.nii.gz' \
	| sed -e 's#\(\.hcp/HCP1200\)\(.*\)#\1\2\x00.\2#' \
	| datalad copy-file -r --specs-from -"

**Step 6:** Publish the dataset to :term:`GitHub` or similar hosting services to allow others to clone it easily and get fast access to a relevant subset of files.

Afterwards, the slimmed down structural connectivity dataset can be installed completely within seconds.
Because of the reduced amount of files it contains, it is easier to transform the data into BIDS format.
Such a conversion can be done on a different :term:`branch` of the dataset.
If you have published your subsampled dataset into a RIA store, as it was done with this specific subset, a single command can clone a BIDS-ified, slimmed down HCP dataset for structural connectivity analyses because RIA stores allow cloning of datasets in specific versions (such as a branch or tag as an identifier)::

   $ datalad clone ria+https://store.datalad.org#~hcp-structural-connectivity@bids

Summary
"""""""

:dlcmd:`copy-file` is a useful command to create datasets from content of other datasets.
Although it requires some Unix-y command line magic, it can be automated for larger tasks, and, when combined with a :dlcmd:`run`, produce suitable provenance records of where files have been copied from.


.. rubric:: Footnotes

.. [#f1] You can read about the human connectome dataset in the use case :ref:`usecase_HCP_dataset`.