It looks like Apple favours the macOS spelling, and according to https://en.wikipedia.org/wiki/MacOS, "in 2016, with the release of macOS 10.12 Sierra, the name was changed from OS X to macOS to align it with the branding of Apple's other primary operating systems". Note for grep users: "MacOSX" in installation is left intentionally (part of miniconda installer naming pattern).
107 lines
5.2 KiB
ReStructuredText
107 lines
5.2 KiB
ReStructuredText
.. _cleanup:
|
|
|
|
Fixing up too-large datasets
|
|
----------------------------
|
|
|
|
The previous section highlighted problems of too large monorepos and advised
|
|
strategies to them prevent them.
|
|
This section introduces some strategies to clean and fix up datasets that got out
|
|
of hand size-wise. If there are use cases you would want to see discussed here
|
|
or propose solutions for, please
|
|
`get in touch <https://github.com/datalad-handbook/book/issues/new>`_.
|
|
|
|
Getting contents out of Git
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
Let's say you did a :dlcmd:`run` with an analysis that put too
|
|
many files under version control by Git, and you want to see them gone.
|
|
Sticking to the FSL FEAT analysis example from earlier, you may, for example,
|
|
want to get rid of every ``tsplot`` directory, as it contains results that are
|
|
irrelevant for you.
|
|
|
|
Note that there is no way to ``drop`` the files as they are in Git instead of
|
|
git-annex. Removing
|
|
the files with plain file system (``rm``, ``git rm``) operation also does not
|
|
shrink your dataset. The files are snapshot and even though they don't exist in
|
|
the current state of your dataset anymore, they still exist -- and thus clutter
|
|
-- your datasets history. In order to *really* get committed files out of Git,
|
|
you need to rewrite history. And for this you need heavy machinery:
|
|
`git-filter-repo <https://github.com/newren/git-filter-repo>`_ [#f1]_.
|
|
It is a powerful and potentially dangerous tool to rewrite Git history.
|
|
Treat this tool like a chainsaw. Very helpful for heavy duty tasks, but also
|
|
life-threatening. The command
|
|
``git-filter-repo <path-specification> --force`` will "filter-out", i.e., remove
|
|
all files **but the ones specified** in ``<path-specification>`` from the datasets
|
|
history. Before you use it, please make sure to read its help page thoroughly.
|
|
|
|
.. find-out-more:: Installing git-filter-repo
|
|
|
|
``git-filter-repo`` is not part of Git but needs to be installed separately.
|
|
Its `GitHub repository <https://github.com/newren/git-filter-repo>`_ contains
|
|
more and more detailed instructions, but it is possible to install via :term:`pip`
|
|
(``pip install git-filter-repo``), and available via standard package managers
|
|
for macOS and some Linux distributions (mostly rpm-based ones).
|
|
|
|
The general procedure you should follow is the following:
|
|
|
|
1. :dlcmd:`clone` the repository. This is a safeguard to protect your
|
|
dataset should something go wrong. The clone you are creating will be your
|
|
new, cleaned up dataset.
|
|
2. :dlcmd:`get` all the dataset contents by running ``datalad get .``
|
|
in the clone.
|
|
3. ``git-filter-repo`` what you don't want anymore (see below)
|
|
4. Run ``git annex unused`` and a subsequent ``git annex dropunused all`` to remove
|
|
stale file contents that are not referenced anymore.
|
|
5. Finally, do some aggressive `garbage collection <https://git-scm.com/docs/git-gc>`_
|
|
with ``git gc --aggressive``
|
|
|
|
In order to get a hang on the ``git-filter-repo`` step, consider a directory
|
|
structure similar to this exemplary run-wise FEAT analysis output structure:
|
|
|
|
.. code-block:: bash
|
|
|
|
$ tree
|
|
sub-*/run-*_<task>-<level>.feat
|
|
├── custom_timing_files
|
|
├── logs
|
|
├── reg
|
|
├── reg_standard
|
|
│ ├── reg
|
|
│ └── stats
|
|
├── stats
|
|
└── tsplot
|
|
|
|
Each of such ``sub-*`` directories contains about 3000 files, and the majority of
|
|
them are irrelevant text files in ``tsplot/``.
|
|
In order to remove them for all subjects and runs from the dataset history,
|
|
the following command can be used::
|
|
|
|
$ git-filter-repo --path-regex '^sub-[0-9]{2}/run-[0-9]{1}*.feat/tsplot/.*$' --invert-paths --force
|
|
|
|
The ``--path-regex`` and the regex expression ``'^sub-[0-9]{2}/run-[0-9]{1}*.feat/tsplot/.*$'`` [#f2]_
|
|
match all file paths inside of the ``tsplot/`` directories of all subjects and
|
|
runs.
|
|
The option ``--invert-paths`` then *inverts* this path specification, and leads
|
|
to only the files in ``tsplot/`` to be filtered out. Note that there are also
|
|
non-regex based path specifications possible, for example with the option
|
|
``--path-match`` or ``path-glob``, or with a specification placed in a file.
|
|
Please see the manual of ``git-filter-repo`` for more information.
|
|
|
|
|
|
.. rubric:: Footnotes
|
|
|
|
.. [#f1] Wait, what about ``git filter-branch``? Beyond better performance of
|
|
``git-filter-repo``, Git also discourages the use of ``filter-branch``
|
|
for safety reasons and points to ``git-filter-repo`` as an alternative.
|
|
For more background info, see this
|
|
`thread <https://lore.kernel.org/git/CABPp-BEr8LVM+yWTbi76hAq7Moe1hyp2xqxXfgVV4_teh_9skA@mail.gmail.com>`_.
|
|
|
|
.. [#f2] Regular expressions can be a pain to comprehend if you're not used to
|
|
reading them. This one matches paths that start with (``^``) ``sub-``
|
|
followed by exactly two (``{2}``) numbers that can be between 0 and 9
|
|
(``[0-9]``), followed by ``/run-`` with exactly one (``{1}``) digit
|
|
between 0 and 9 (``[0-9]``), followed by zero or more other characters
|
|
(``*``) until ``.feat/tsplot/``, and ending (``$``) with any amount of
|
|
any character (``.*``). Not exactly easy, but effective.
|
|
One way to practice reading regular expressions, if you're interested
|
|
in that, is by playing `regex crossword <https://regexcrossword.com>`_.
|