This fixes the usage of contraction it's (it is / it has) and possessive its, as far as I could grep.
277 lines
10 KiB
ReStructuredText
277 lines
10 KiB
ReStructuredText
.. _gists:
|
|
|
|
Gists
|
|
=====
|
|
|
|
|
|
The more complex and larger your DataLad project, the more difficult it is to do
|
|
efficient housekeeping.
|
|
This section is a selection of code snippets tuned to perform specific,
|
|
non-trivial tasks in datasets. Often, they are not limited to single commands of
|
|
the version control tools you know, but combine helpful other command line
|
|
tools and general Unix command line magic. Just like
|
|
`GitHub gists <https://gist.github.com>`_, it's a collection of lightweight
|
|
and easily accessible tips and tricks. For a more basic command overview,
|
|
take a look at the :ref:`cheat`. The
|
|
`tips collection of git-annex <https://git-annex.branchable.com/tips>`_ is also
|
|
a very valuable resource.
|
|
|
|
.. image:: ../artwork/src/gists.svg
|
|
:width: 50%
|
|
:align: center
|
|
|
|
|
|
.. _parallelize:
|
|
|
|
Parallelize subdataset processing
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
DataLad cannot yet parallelize processes that are performed
|
|
independently over a large number of subdatasets. Pushing across a dataset
|
|
hierarchy, for example, is performed one after the other.
|
|
Unix however, has a few tools such as `xargs <https://en.wikipedia.org/wiki/Xargs>`_
|
|
or the ``parallel`` tool of `moreutils <https://joeyh.name/code/moreutils>`_
|
|
that can assist.
|
|
|
|
Here is an example of pushing all subdatasets (and their respective subdatasets)
|
|
recursively to their (identically named) siblings:
|
|
|
|
.. code-block:: console
|
|
|
|
$ datalad -f '{path}' subdatasets | xargs -n 1 -P 10 datalad push -r --to <sibling-name> -d
|
|
|
|
``datalad -f '{path}' subdatasets`` discovers the paths of all subdatasets,
|
|
and ``xargs`` hands them individually (``-n 1``) to a (recursive) :dlcmd:`push`,
|
|
but performs 10 of these operations in parallel (``-P 10``), thus achieving
|
|
parallelization.
|
|
|
|
Here is an example of cross-dataset download parallelization:
|
|
|
|
.. code-block:: console
|
|
|
|
$ datalad -f '{path}' subdatasets | xargs -n 1 -P 10 datalad get -d
|
|
|
|
Operations like this can safely be attempted for all commands that are independent
|
|
across subdatasets.
|
|
|
|
Check whether all file content is present locally
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
In order to check if all the files in a dataset have their file contents locally
|
|
available, you can ask git-annex:
|
|
|
|
.. code-block:: console
|
|
|
|
$ git annex find --not --in=here
|
|
|
|
Any file that does not have its contents locally available will be listed.
|
|
If there are subdatasets you want to recurse into, use the following command
|
|
|
|
.. code-block:: console
|
|
|
|
$ git submodule foreach --quiet --recursive \
|
|
'git annex find --not --in=here --format=$displaypath/$\\{file\\}\\n'
|
|
|
|
Alternatively, to get very comprehensive output, you can use
|
|
|
|
.. code-block:: console
|
|
|
|
$ datalad -f json status --recursive --annex availability
|
|
|
|
The output will be returned as json, and the key ``has_content`` indicates local
|
|
content availability (``true`` or ``false``). To filter through it, the command
|
|
line tool `jq <https://stedolan.github.io/jq>`_ works well:
|
|
|
|
.. code-block:: console
|
|
|
|
$ datalad -f json status --recursive --annex all | jq '. | select(.has_content == true).path'
|
|
|
|
|
|
Drop annexed files from all past commits
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
If there is annexed file content that is not used anymore (i.e., data in the
|
|
annex that no files in any branch point to anymore such as corrupt files),
|
|
you can find out about it and remove this file content out of your dataset
|
|
(i.e., completely and irrecoverably delete it) with git-annex's commands
|
|
:gitannexcmd:`unused` and :gitannexcmd:`dropunused``.
|
|
|
|
Find out which file contents are unused (not referenced by any current branch):
|
|
|
|
.. code-block:: console
|
|
|
|
$ git annex unused
|
|
unused . (checking for unused data...)
|
|
Some annexed data is no longer used by any files in the repository.
|
|
NUMBER KEY
|
|
1 SHA256-s86050597--6ae2688bc533437766a48aa19f2c06be14d1bab9c70b468af445d4f07b65f41e
|
|
2 SHA1-s14--f1358ec1873d57350e3dc62054dc232bc93c2bd1
|
|
(To see where data was previously used, try: git log --stat -S'KEY')
|
|
(To remove unwanted data: git-annex dropunused NUMBER)
|
|
ok
|
|
|
|
Remove a single unused file by specifying its number in the listing above:
|
|
|
|
.. code-block:: console
|
|
|
|
$ git annex dropunused 1
|
|
dropunused 1 ok
|
|
|
|
Or a range of unused data with
|
|
|
|
.. code-block:: console
|
|
|
|
$ git annex dropunused 1-1000
|
|
|
|
Or all
|
|
|
|
.. code-block:: console
|
|
|
|
$ git annex dropunused all
|
|
|
|
|
|
Getting single file sizes prior to downloading from the Python API and the CLI
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
For a single file, :dlcmd:`status --annex -- myfile` will report on
|
|
the size of the file prior to a :dlcmd:`get`.
|
|
|
|
If you want to do it in Python, try this approach:
|
|
|
|
.. code-block:: python
|
|
|
|
import datalad.api as dl
|
|
|
|
ds = dl.Dataset("/path/to/some/dataset")
|
|
results = ds.status(path=<path or list of paths>, annex="basic", result_renderer=None)
|
|
|
|
|
|
Check whether a dataset contains an annex
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
Datasets can be either GitRepos (i.e., sole Git repositories; this happens when
|
|
they are created with the ``--no-annex`` flag, for example), or AnnexRepos
|
|
(i.e., datasets that contain an annex). Information on what kind of repository it
|
|
is is stored in the dataset report of :dlcmd:`wtf` under the key ``repo``.
|
|
Here is a one-liner to get this info:
|
|
|
|
.. code-block:: console
|
|
|
|
$ datalad -f'{infos[dataset][repo]}' wtf
|
|
|
|
|
|
.. index::
|
|
pair: create-sibling; DataLad command
|
|
|
|
Backing-up datasets
|
|
^^^^^^^^^^^^^^^^^^^
|
|
|
|
In order to back-up datasets you can publish them to a
|
|
:term:`Remote Indexed Archive (RIA) store` or to a sibling dataset. The former
|
|
solution does not require Git, git-annex, or DataLad to be installed on the
|
|
machine that the back-up is pushed to, the latter does require them.
|
|
|
|
To find out more about RIA stores, checkout the online-handbook.
|
|
A sketch of how to implement a sibling for backups is below:
|
|
|
|
.. code-block:: console
|
|
|
|
$ # create a back up sibling
|
|
$ datalad create-sibling --annex-wanted anything -r myserver:/path/to/backup
|
|
$ # publish a full backup of the current branch
|
|
$ datalad publish --to=myserver -r
|
|
$ # subsequently, publish updates to be backed up with
|
|
$ datalad publish --to=myserver -r --since= --missing=inherit
|
|
|
|
In order to push not only the current branch, but refs, add the option
|
|
``--publish-by-default "refs/*"`` to the :dlcmd:`create-sibling` call.
|
|
Should you want to back up all annexed data, even past versions of files, use
|
|
:gitannexcmd:`sync` to push to the sibling:
|
|
|
|
.. code-block:: console
|
|
|
|
$ git annex sync --all --content <sibling-name>
|
|
|
|
For an in-depth explanation and example take a look at the
|
|
`GitHub issue that raised this question <https://github.com/datalad/datalad/issues/4369>`_.
|
|
|
|
.. _retrieveHCP:
|
|
|
|
Retrieve partial content from a hierarchy of (uninstalled) datasets
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
In order to :dlcmd:`get` dataset content across a range of subdatasets, a bit
|
|
of UNIX command line foo can increase the efficiency of your command.
|
|
|
|
Example: consider retrieving all ``ribbon.nii.gz`` files for all subjects in the
|
|
`HCP open access dataset <https://github.com/datalad-datasets/human-connectome-project-openaccess>`_
|
|
(a dataset with about 4500 subdatasets -- read on more about it in
|
|
:ref:`usecase_HCP_dataset`).
|
|
If all subject-subdatasets are installed (e.g., with ``datalad get -n -r`` for
|
|
a recursive installation without file retrieval), :term:`globbing` with the
|
|
shell works fine:
|
|
|
|
.. code-block:: console
|
|
|
|
$ datalad get HCP1200/*/T1W/ribbon.nii.gz
|
|
|
|
The Gist :ref:`parallelize` can show you how to parallelize this.
|
|
If the subdatasets are not yet installed, globbing will not work, because the
|
|
shell can't expand non-existent paths. As an alternative, you can pipe the output
|
|
of an (arbitrarily complex) :dlcmd:`search` command into
|
|
:dlcmd:`get`:
|
|
|
|
.. code-block:: console
|
|
|
|
$ datalad -f '{path}' -c datalad.search.index-egrep-documenttype=all search 'path:.*T1w.*\.nii.gz' | xargs -n 100 datalad get
|
|
|
|
However, if you know the file locations within the dataset hierarchy and they
|
|
are predictably named and consistent, you can create a file containing all paths to
|
|
be retrieved and pipe that into :dlcmd:`get` as well:
|
|
|
|
.. code-block:: console
|
|
|
|
$ # create file with all file paths
|
|
$ for sub in HCP1200/*; do echo ${sub}/T1w/ribbons.nii.gz; done > toget.txt
|
|
$ # pipe it into datalad get
|
|
$ cat toget.txt | xargs -n 100 datalad get
|
|
|
|
.. _speedystatus:
|
|
|
|
Speed up status reports in large datasets
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
In datasets with deep dataset hierarchies or large numbers of files,
|
|
:dlcmd:`status` calls can be expensive. Handily,
|
|
the command provides options that can boost performance by limiting what is being
|
|
tested and reported. In order to speed up subdataset state state evaluation,
|
|
``-e/--eval-subdataset-state`` can be set ``commit`` or ``no``. Instead of checking
|
|
recursively for uncommitted modifications in subdatasets, this would lead ``status``
|
|
to only compare the most recent commit :term:`shasum` in the subdataset against
|
|
the recorded subdataset state in the superdataset (``commit``), or skip subdataset
|
|
state evaluation completely (``no``). In order to speed up file type evaluation,
|
|
the option ``-t/--report-filetype`` can be set to ``raw``. This skips an evaluation
|
|
on whether symlinks are pointers to annexed file (upon which, if true, the symlink
|
|
would be reported as type "file"). Instead, all symlinks will be reported as
|
|
being of type "symlink".
|
|
|
|
Squashing git-annex history
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
A large number of commits in the :term:`git-annex branch` (think: thousands
|
|
rather than hundreds) can inflate your repository and increase the size of the
|
|
``.git`` directory, which can lead to slower cloning operations.
|
|
There are, however, ways to shrink the commit history in the annex branch.
|
|
|
|
In order to :term:`squash` the entire git-annex history into a single commit, run
|
|
|
|
.. code-block:: console
|
|
|
|
$ git annex forget --drop-dead --force
|
|
|
|
Afterwards, if your dataset has a sibling, the branch needs to be
|
|
:term:`force-push`\ed. If you attempt an operation to shrink your git-annex
|
|
history, also checkout
|
|
`this thread <https://git-annex.branchable.com/forum/safely_dropping_git-annex_history>`_
|
|
for more information on shrinking git-annex's history and helpful safeguards and
|
|
potential caveats.
|