62 lines
No EOL
6.1 KiB
ReStructuredText
62 lines
No EOL
6.1 KiB
ReStructuredText
.. _annexprs:
|
|
|
|
Pull requests with annexed file contents
|
|
----------------------------------------
|
|
|
|
:term:`Git` has revolutionized collaborative processes on text files.
|
|
The concept is simple: In order to integrate new changes into a main development line, the changes are :term:`merge`\d in from a different branch.
|
|
Individual contributors became able to clone repositories, make changes in new branches, and email these changes as patches or requests to pull from their branches to maintainers.
|
|
Hosting platforms like :term:`GitHub` or :term:`GitLab` made this even simpler by introducing :term:`pull request`\s (GitHub) or :term:`merge request`\s (GitLab) that contributors can initiate when they push new branches to their :term:`fork`\s.
|
|
|
|
|
|
.. figure:: ../artwork/src/branching/collab_4_annex_1.svg
|
|
:scale: 150%
|
|
|
|
A typical collaborative setup. A Git repository (``origin``) is the central development place, living on a hosting site such as :term:`GitHub`. Individual contributors :term:`fork` this repository. Rather than adding commits to the blue main development line, they make changes on a new branch - here denoted with a different color. By requesting that the central repository "pulls in" or "merges" from their branch, e.g., the ``fix-paths`` branch, the main line in the central repository grows collaboratively.
|
|
|
|
This process is not as straight-forward when both **forks** and **annexed file contents** are involved. For example if a collaborator adds a commit in their fork with a new annexed file.
|
|
In this scenario, a regular Git branch learns about the file identity (its hash and name), but not about its availability information, for example from which :term:`special remote` the file content can be obtained from - this information is on the :term:`git-annex` branch [#1]_. The content of the annexed file, meanwhile, likely exists on infrastructure accessible to the contributor, but not necessarily the owner of the origin dataset.
|
|
|
|
The problem is thus multi-faceted: For one, as crucial availability information is tracked on the :term:`git-annex branch`, a "single branch PR" is not able to propagate all relevant information.
|
|
After merging only the Git branch, without the availability information on the annex-branch, collaborators know that the file *exists*, but have no way of *getting* or even *querying* it: the git-annex branch on the main development repository would not know about the file.
|
|
Propagating information in the ``git-annex`` branch requires, for example, a ``git annex sync`` or ``datalad push`` (which syncs the annex branch internally) to the ``origin`` repository.
|
|
|
|
.. figure:: ../artwork/src/branching/collab_4_annex.svg
|
|
:scale: 150%
|
|
|
|
A similar scenario as before, but with annexed file contents. If a collaborator adds annexed files to a branch in their fork (e.g., in the commit "first results computed!"), merging only the ``testrun`` branch is insufficient. While the existence of the file will be known in the origin dataset's Git history, its annex will have no information on it, making any git-annex operation on this file fail.
|
|
|
|
.. gitusernote:: Don't PR the annex branch
|
|
|
|
Importantly, the issue wouldn't be solved if one would PR the :term:`git-annex branch` from their fork. It is not a branch that one can create a pull request with - in fact, it shouldn't even really be checked out manually. It is automatically managed by git-annex, and the necessary synchronization to update availability information across dataset siblings usually requires a `git annex sync <https://git-annex.branchable.com/sync/>`_ (which a ``datalad push`` runs internally).
|
|
|
|
|
|
The second issue concerns the annexed *contents*.
|
|
A contributor needs to ensure that file contents are accessible for whomever they contribute to.
|
|
This is typically a manual process.
|
|
Even when annexed contents can be pushed to their fork in annex-aware hosting like :term:`Gin` or :term:`forgejo-aneksajo`,the annexed content is not copied between a fork and the main repository when a pull request is merged.
|
|
This would mean that the repo from which the PR is made would effectively be the only source of the content until it is explicitly transferred.
|
|
This is different to git-only PR workflows, where the fork is essentially throwaway.
|
|
If the annexed contents are hosted on third party services or multi-user compute infrastructure, contributors need to ensure that their intended receiver has appropriate access permissions.
|
|
|
|
What are possible solutions?
|
|
****************************
|
|
|
|
Especially with the availability of annex-aware hosting like :term:`Gin` or :term:`forgejo-aneksajo`, it becomes more and more relevant to know about the issues with annex PRs.
|
|
But what are the solutions?
|
|
At the moment, there are mostly work-arounds.
|
|
|
|
The easiest is not using :term:`fork`\s, whenever permission management allows that, and instead working within the main development dataset: PRs from a branch within the dataset instead of from a fork are not an issue.
|
|
With write access, another possibility would be to keep using forks, but merge and transfer contents "manually", i.e., ``git switch main``; ``git merge feature``; ``[datalad | git-annex] push``.
|
|
Whether having a fork adds anything of value in this situation is questionable, though.
|
|
|
|
Often however, write access for the contributor is not possible, for example because an organization doesn't want to make every possible contributor from the outside a member with the necessary elevated permissions.
|
|
In these cases, 3rd party services can act as a transfer solution for annexed contents.
|
|
The templateflow project came up with an :term:`Open Science Framework` based approach for external contributors: `www.templateflow.org/contributing/submission <https://www.templateflow.org/contributing/submission>`_.
|
|
With git-annex-aware hosting like :term:`forgejo-aneksajo`, permission management could also work "the other way around":
|
|
The maintainer with write access to the origin dataset that receives a PR gets access *to the repo from which the PR is made* (read access should suffice).
|
|
The maintainer can merge locally, get the files, and push them to their repository.
|
|
|
|
.. rubric:: Footnotes
|
|
|
|
.. [#1] If this does not sound familiar to you, please re-read chapter :ref:`chapter_gitannex`. |