datalad-handbook/docs/basics/101-141-push.rst
Michael Hanke 406ec9ade6 Restandardize on 'online-handbook'
I hope this does not ruin the formatting in dataset ops.
2023-11-13 12:46:23 +01:00

187 lines
10 KiB
ReStructuredText

.. index::
pair: push; DataLad command
.. _push:
The datalad push command
------------------------
Previous sections on publishing DataLad datasets have each
shown you crucial aspects of the functions of dataset publishing with
:dlcmd:`push`. This section wraps them all together.
The general overview
^^^^^^^^^^^^^^^^^^^^
:dlcmd:`push` is the command to turn to when you want to publish datasets.
It is capable of publishing all dataset content, i.e., files stored in :term:`Git`,
and files stored with :term:`git-annex`, to a known dataset :term:`sibling`.
.. index::
pair: push; DataLad concept
.. gitusernote:: Push internals
The :dlcmd:`push` uses ``git push``, and ``git annex copy`` under
the hood. Publication targets need to either be configured remote Git repositories,
or git-annex special remotes (if they support data upload).
In order to publish a dataset, the dataset needs to have a sibling to push to.
This, for instance, can be a :term:`GitHub`, :term:`GitLab`, or :term:`GIN`
repository, but it can also be a Remote Indexed Archive (RIA) store for backup
or storage of datasets [#f1]_, or a regular clone.
.. index::
pair: create-sibling-github; DataLad command
pair: create-sibling-gitlab; DataLad command
pair: create-sibling-ria; DataLad command
pair: GitHub; dataset hosting
pair: GitLab; dataset hosting
pair: RIA; dataset hosting
pair: create sibling; with DataLad
.. find-out-more:: all of the ways to configure siblings
- Add an existing repository as a sibling with the :dlcmd:`siblings`
command. Here are common examples:
.. code-block:: console
$ # to a remote repository
$ datalad siblings add --name github-repo --url <url.to.github>
$ # to a local path
$ datalad siblings add --name local-sibling --url /path/to/sibling/ds
$ # to a clone on an SSH-accessible machine
$ datalad siblings add --name server-sibling --url [user@]hostname:/path/to/sibling/ds
- Create a sibling on an external hosting service from scratch, right from
within your repository:
This can be done with the commands :dlcmd:`create-sibling-github` (for GitHub)
or :dlcmd:`create-siblings-gitlab` (for GitLab), or
:dlcmd:`create-sibling-ria` (for a remote indexed archive dataset store).
Note that :dlcmd:`create-sibling-ria` can add an existing store as a sibling
or create a new one from scratch.
- Create a sibling on a local or SSH accessible Unix machine with
:dlcmd:`create-sibling`.
In order to publish dataset content, DataLad needs to know to which sibling
content shall be pushed. This can be specified with the ``--to`` option directly
from the command line:
.. code-block:: console
$ datalad push --to <sibling>
If you have more than one :term:`branch` in your dataset, note that a
:dlcmd:`push` command will by default update only the current branch.
If updating multiple branches is relevant for your workflow, please check out
the :ref:`find-out-more about this <fom-push-branch>`.
By default, :dlcmd:`push` will make the last saved state of the dataset
available. Consequently, if the sibling is in the same state as the dataset,
no push is attempted.
Additionally, :dlcmd:`push` will attempt to automatically decide what type
of dataset contents are going to be published. With a sibling that has a
:term:`special remote` configured as a :term:`publication dependency`,
or a sibling that contains an annex (such as a GIN repository or a
:term:`Remote Indexed Archive (RIA) store`), both the contents
stored in Git (i.e., a dataset's history) as well as file contents stored in
git-annex will be published unless dataset configurations overrule this.
Alternatively, one can enforce particular operations or push a subset of dataset
contents. For one, when specifying a path in the :dlcmd:`push` command,
only data or changes for those paths are considered for a push.
Additionally, one can select a particular mode of operation with the ``-data`` option.
Several different modes are possible:
- ``nothing``: With this option, annexed contents are not published. This
means that the sibling will have information on the annexed files' names, but
file contents will not be available, and thus ``datalad get`` calls in the
sibling would fail.
- ``anything``: Transfer all annexed contents.
- ``auto``: With this option, the decision which data is transferred is based on configurations that can determine rules on a per-file and per-sibling level.
On a technical level, the ``git annex copy`` call to publish file contents is called with its ``--auto`` option.
With this option, only data that satisfies specific git-annex configurations gets transferred.
Those configurations could be ``numcopies`` settings (the number of copies available at different remotes), or ``wanted`` settings (preferred contents for a specific remote), and need to be created by a user [#f2]_ with git-annex commands. If you have files you want to keep private, or do not need published, these configurations are very useful.
- ``auto-if-wanted`` (Default): Unless a ``wanted`` or ``numcopies`` configuration exists in the dataset, all content are published. Should a ``wanted`` or ``numcopies`` configuration exist, the command enables ``--auto`` in the underlying ``git annex copy`` call.
Beyond different modes of transferring data, the ``-f/--force`` option allows to force specific publishing operations with three different modes.
Be careful when using it, as its modes possibly overrule safety protections or optimizations:
- ``checkdatapresent``: With this option, the underlying ``git annex copy`` call to
publish file contents is invoked without a ``--fast`` option. Usually, the
``--fast`` option increases the speed of the operation, as it disables a check
whether the sibling already has content. This however, might skip copying content
in some cases. Therefore, ``--force datatransfer`` is a slower, but more fail-safe
option to publish annexed file contents.
- ``gitpush``: This option triggers a ``git push --force``. Be very careful using
this option! If the changes on the dataset conflict with the changes that exist
in the sibling, the changes in the sibling will be overwritten.
- ``all``: The final mode, ``all``, combines all force modes -- thus attempting to really get your dataset contents published by any means.
:dlcmd:`push` can publish available subdatasets recursively if the
``-r/--recursive`` flag is specified. Note that this requires that all subdatasets
that should be published have sibling names identical to the sibling specified in
the top-level :dlcmd:`push` command, or that appropriate default publication
targets are configured throughout the dataset hierarchy.
.. index::
pair: configure which branches to push; with Git
.. find-out-more:: Pushing more than the current branch
:name: fom-push-branch
:float:
If you have more than one :term:`branch` in your
dataset, a :dlcmd:`push --to <sibling>` will by default only push
the current :term:`branch`, *unless* you provide configurations that alter
this default. Here are two ways in which this can be achieved:
**Option 1:** Setting the ``push.default`` configuration variable from
``simple`` (the default) to ``matching`` will configure the dataset such that
:dlcmd:`push` pushes *all* branches to the sibling.
A concrete example: On a dataset level, this can be done using
.. code-block:: console
$ git config --local push.default matching
**Option 2:**
`Tweaking the default push refspec <https://git-scm.com/book/en/v2/Git-Internals-The-Refspec>`_ for the dataset allows to
select a range of branches that should be pushed. The link above gives a
thorough introduction into the refspec. For a hands-on example, consider how it is done for
`the published DataLad-101 dataset <https://github.com/datalad-handbook/DataLad-101>`_:
The published version of the handbook is known to the local handbook dataset
as a :term:`remote` called ``public``, and each section of the book is identified
with a custom branch name that corresponds to the section name. Whenever an
update to the public dataset is pushed, apart from pushing only the ``main``
branch, all branches starting with the section identifier ``sct`` are pushed
automatically as well. This configuration was achieved by specifying these branches
(using :term:`globbing` with ``*``) in the ``push`` specification of this :term:`remote`:
.. code-block:: console
$ git config --local remote.public.push 'refs/heads/sct*'
Pushing errors
^^^^^^^^^^^^^^
If you are unfamiliar with Git, please be aware that cloning a dataset to a different place and subsequently pushing to it can lead to Git error messages if changes are pushed to a currently checked out :term:`branch` of the sibling (in technical Git terms: When pushing to a checked-out branch of a non-bare repository remote).
As an example, consider what happens if we attempt a :dlcmd:`push` to the sibling ``roommate`` that we created in the chapter :ref:`chapter_collaboration`:
.. runrecord:: _examples/DL-101-141-101
:language: console
:exitcode: 1
:workdir: dl-101/DataLad-101
$ datalad push --to roommate
Publishing fails with the error message ``[remote rejected] (branch is currently checked out)``.
This can be prevented with `configuration settings <https://github.blog/2015-02-06-git-2-3-has-been-released>`_ in Git versions 2.3 or higher, or by pushing to a branch of the sibling that is currently not checked-out.
For more information on this, and other error messages during push, please checkout the section :ref:`help`.
.. rubric:: Footnotes
.. [#f1] RIA siblings are file system based, scalable storage solutions for
DataLad datasets. You can find out more about them in the online-handbook.
.. [#f2] For information on the ``numcopies`` and ``wanted`` settings of git-annex see its documentation at `git-annex.branchable.com/git-annex-wanted/ <https://git-annex.branchable.com/git-annex-wanted>`_ and `git-annex.branchable.com/git-annex-numcopies/ <https://git-annex.branchable.com/git-annex-numcopies>`_.