datalad-handbook/docs/basics/101-139-hostingservices.rst
Michał Szczepanik 523fa30263 Fix "it's" vs "its" usage
This fixes the usage of contraction it's (it is / it has) and
possessive its, as far as I could grep.
2024-05-17 21:55:24 +02:00

385 lines
23 KiB
ReStructuredText

.. _share_hostingservice:
Publishing datasets to Git repository hosting
---------------------------------------------
Because DataLad datasets are :term:`Git` repositories, it is possible to
:dlcmd:`push` datasets to any Git repository hosting service, such as
:term:`GitHub`, :term:`GitLab`, :term:`GIN`, :term:`Bitbucket`, `Gogs <https://gogs.io>`_, or Gitea_.
These published datasets are ordinary :term:`sibling`\s of your dataset, and among other advantages, they can constitute a back-up, an entry-point to retrieve your dataset for others or yourself, the backbone for collaboration on datasets, or the means to enhance visibility, findability and citeability of your work [#f1]_.
This section contains a brief overview on how to publish your dataset to different services.
Git repository hosting and annexed data
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
As outlined in a number of sections before, Git repository hosting sites typically do not support dataset annexes - some, like :term:`GIN` however, do.
Depending on whether or not an annex is supported, you can push either only your Git history to the sibling, or the complete dataset including annexed file contents.
You can find out whether a sibling on a remote hosting services carries an annex or not by running the :dlcmd:`siblings` command.
A ``+``, ``-``, or ``?`` sign in parenthesis indicates whether the sibling carries an annex, does not carry an annex, or whether this information isn't yet known.
In the example below you can see that the public GitHub repository `github.com/psychoinformatics-de/studyforrest-data-phase2 <https://github.com/psychoinformatics-de/studyforrest-data-phase2>`_ does not carry an annex on GitHub (the sibling ``origin``), but that the annexed data are served from an additional sibling ``mddatasrc`` (a :term:`special remote` with annex support).
Even though the dataset sibling on GitHub does not serve the data, it constitutes a simple, findable access point to retrieve the dataset, and can be used to provide updates and fixes via :term:`pull request`\s, issues, etc.
.. code-block:: console
$ # a clone of github/psychoinformatics/studyforrest-data-phase2 has the following siblings:
$ datalad siblings
.: here(+) [git]
.: mddatasrc(+) [https://datapub.fz-juelich.de/studyforrest/studyforrest/phase2/.git (git)]
.: origin(-) [git@github.com:psychoinformatics-de/studyforrest-data-phase2.git (git)]
There are multiple ways to create a dataset sibling on a repository hosting site to push your dataset to.
How to add a sibling on a Git repository hosting site: The manual way
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#. Create a new repository via the webinterface of the hosting service of your choice. The screenshots in :numref:`fig-newrepogin` and :numref:`fig-newrepogithub` show examples of this.
The new repository does not need to have the same name as your local dataset, but it helps to associate local dataset and remote siblings.
#. Afterwards, copy the :term:`SSH` or :term:`HTTPS` URL of the repository. Usually, repository hosting services will provide you with a convenient way to copy it to your clipboard. An SSH URL takes the form ``git@<hosting-service>:/<user>/<repo-name>.git`` and an HTTPS URL takes the form ``https://<hosting-service>/<user>/<repo-name>.git``. The type of URL you choose determines whether and how you will be able to ``push`` to your repository. Note that many services will require you to use the SSH URL to your repository in order to do :dlcmd:`push` operations, so make sure to take the :term:`SSH` and not the :term:`HTTPS` URL if this is the case.
#. If you pick the :term:`SSH` URL, make sure to have an :term:`SSH key` set up. This usually requires generating an SSH key pair if you do not have one yet, and uploading the public key to the repository hosting service. The :find-out-more:`on SSH keys <fom-sshkey>` points to a useful tutorial for this.
#. Use the URL to add the repository as a sibling. There are two commands that allow you to do that; both require that you give the sibling a name of your choice (common name choices are ``upstream``, or a short-cut for your user name or the hosting platform, but it's completely up to you to decide):
#. ``git remote add <name> <url>``
#. ``datalad siblings add --dataset . --name <name> --url <url>``
#. Push your dataset to the new sibling: ``datalad push --to <name>``
.. _fig-newrepogin:
.. figure:: ../artwork/src/GIN_newrepo.png
:width: 80%
Webinterface of :term:`GIN` during the creation of a new repository.
.. _fig-newrepogithub:
.. figure:: ../artwork/src/newrepo-github.png
:width: 80%
Webinterface of :term:`GitHub` during the creation of a new repository.
.. index:: concepts; SSH key, SSH; key
.. _sshkey:
.. find-out-more:: What is an SSH key and how can I create one?
:name: fom-sshkey
An SSH key is an access credential in the :term:`SSH` protocol that can be used
to login from one system to remote servers and services, such as from your private
computer to an :term:`SSH server`. For repository hosting services such as :term:`GIN`,
:term:`GitHub`, or :term:`GitLab`, it can be used to connect and authenticate
without supplying your username or password for each action.
A tutorial by GitHub at `docs.github.com/en/github/authenticating-to-github/connecting-to-github-with-ssh <https://docs.github.com/en/authentication/connecting-to-github-with-ssh/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent>`_
has a detailed step-by-step instruction to generate and use SSH keys for authentication.
You will also learn how add your public SSH key to your hosting service account
so that you can install or clone datasets or Git repositories via ``SSH`` (in addition
to the ``http`` protocol).
Don't be intimidated if you have never done this before -- it is fast and easy:
First, you need to create a private and a public key (an SSH key pair).
All this takes is a single command in the terminal. The resulting files are
text files that look like someone spilled alphabet soup in them, but constitute
a secure password procedure.
You keep the private key on your own machine (the system you are connecting from,
and that **only you have access to**),
and copy the public key to the system or service you are connecting to.
On the remote system or service, you make the public key an *authorized key* to
allow authentication via the SSH key pair instead of your password. This
either takes a single command in the terminal, or a few clicks in a web interface
to achieve.
You should protect your SSH keys on your machine with a passphrase to prevent
others -- e.g., in case of theft -- to log in to servers or services with
SSH authentication [#f2]_, and configure an ``ssh agent``
to handle this passphrase for you with a single command. How to do all of this
is detailed in the tutorial.
How to add a sibling on a Git repository hosting site: The automated way
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
DataLad provides ``create-sibling-*`` commands to automatically create datasets on certain hosting sites.
You can automatically create new repositories from the command line for :term:`GitHub`, :term:`GitLab`, :term:`GIN`, `Gogs <https://gogs.io>`__, or Gitea_.
This is implemented with a set of commands called :dlcmd:`create-sibling-github`, :dlcmd:`create-sibling-gitlab`, :dlcmd:`create-sibling-gin`, :dlcmd:`create-sibling-gogs`, and :dlcmd:`create-sibling-gitea`.
Each command is slightly tuned towards the peculiarities of each particular platform, but the most important common parameters are streamlined across commands as follows:
- ``[REPONAME]`` (required): The name of the repository on the hosting site. It will be created under a user's namespace, unless this argument includes an organization name prefix. For example, ``datalad create-sibling-github my-awesome-repo`` will create a new repository under ``github.com/<user>/my-awesome-repo``, while ``datalad create-sibling-github <orgname>/my-awesome-repo`` will create a new repository of this name under the GitHub organization ``<orgname>`` (given appropriate permissions).
- ``-s/--name <name>`` (required): A name under which the sibling is identified. By default, it will be based on or similar to the hosting site. For example, the sibling created with ``datalad create-sibling-github`` will be called ``github`` by default.
- ``--credential <name>`` (optional): Credentials used for authentication are stored internally by DataLad under specific names. These names allow you to have multiple credentials, and flexibly decide which one to use. When ``--credential <name>`` is the name of an existing credential, DataLad tries to authenticate with the specified credential; when it does not yet exist DataLad will prompt interactively for a credential, such as an access token, and store it under the given ``<name>`` for future authentications. By default, DataLad will name a credential according to the hosting service URL it used for, such as ``datalad-api.github.com`` as the default for credentials used to authenticate against GitHub.
- ``--access-protocol {https|ssh|https-ssh}`` (default ``https``): Whether to use :term:`SSH` or :term:`HTTPS` URLs, or a hybrid version in which HTTPS is used to *pull* and SSH is used to *push*. Using :term:`SSH` URLs requires an :term:`SSH key` setup, but is a very convenient authentication method, especially when pushing updates -- which would need manual input on user name and token with every ``push`` over HTTPS.
- ``--dry-run`` (optional): With this flag set, the command will not actually create the target repository, but only perform tests for name collisions and report repository name(s).
- ``--private`` (optional): A switch that, if set, makes sure that the created repository is private.
Other streamlined arguments, such as ``--recursive`` or ``--publish-depends`` allow you to perform more complex configurations, such as publication of dataset hierarchies or connections to :term:`special remote`\s. Upcoming walk-throughs will demonstrate them.
Self-hosted repository services, e.g., Gogs or Gitea instances, have an additional required argument, the ``--api`` flag.
It needs to point to the URL of the instance, for example
.. code-block:: console
$ datalad create-sibling-gogs my_repo_on_gogs --api "https://try.gogs.io"
:term:`GitLab`'s internal organization differs from that of the other hosting services, and as there are multiple different GitLab instances, ``create-sibling-gitlab`` requires slightly more configuration than the other commands.
Thus, a short walk-through is at the :ref:`end of this section <gitlab>`.
.. _token:
Authentication by token
^^^^^^^^^^^^^^^^^^^^^^^
To create or update repositories on remote hosting services you will need to set up appropriate authentication and permissions.
In most cases, this will be in the form of an authorization token with a specific permission scope.
What is a token?
""""""""""""""""
Personal access tokens are an alternative to authenticating via your password, and take the form of a long character string, associated with a human-readable name or description.
If you are prompted for ``username`` and ``password`` in the command line, you would enter your token in place of the ``password`` [#f3]_.
Note that you do not have to type your token at every authentication -- your token will be stored on your system the first time you have used it and automatically reused whenever relevant.
.. index:: credential; storage
.. find-out-more:: How does the authentication storage work?
Passwords, user names, tokens, or any other login information is stored in
your system's (encrypted) `keyring <https://en.wikipedia.org/wiki/GNOME_Keyring>`_.
It is a built-in credential store, used in all major operating systems, and
can store credentials securely.
You can have multiple tokens, and each of them can get a different scope of permissions, but it is important to treat your tokens like passwords and keep them secret.
Which permissions do they need?
"""""""""""""""""""""""""""""""
The most convenient way to generate tokens is typically via the webinterface of the hosting service of your choice.
Often, you can specifically select which set of permissions a specific token has in a drop-down menu similar (but likely not identical) to the screenshot from GitHub in :numref:`fig-token`.
.. _fig-token:
.. figure:: ../artwork/src/github-token.png
:width: 80%
Webinterface to generate an authentication token on GitHub. One typically has to set a name and
permission set, and potentially an expiration date.
For creating and updating repositories with DataLad commands it is usually sufficient to grant only repository-related permissions.
However, broader permission sets may also make sense.
Should you employ GitHub workflows, for example, a token without "workflow" scope could not push changes to workflow files, resulting in errors like this one:
.. code-block:: console
[remote rejected] (refusing to allow a Personal Access Token to create or update workflow `.github/workflows/benchmarks.yml` without `workflow` scope)]
.. _gitlab:
Creating a sibling on GitLab
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
:term:`GitLab` is an open source Git repository hosting platform, and many institutions and companies deploy their own instance.
This short walk-through demonstrates the necessary steps to create a GitLab sibling, and the different options GitLab allows for when creating siblings recursively for a dataset hierarchy.
Step 1: Configure your site
"""""""""""""""""""""""""""
As a first step, users will need to create a configuration file following the format of `python-gitlab <https://python-gitlab.readthedocs.io/en/stable/cli-usage.html#configuration-file-format>`_.
This configuration file is typically called ``.python-gitlab.cfg`` and placed into a users home directory.
It contains one section per GitLab instance, and a ``[global]`` section that defines the default instance to use.
Here is an example:
.. code-block:: console
$ cat ~/.python-gitlab.cfg
[global]
default = my-university-gitlab
ssl_verify = true
timeout = 5
[my-university-gitlab]
url = https://gitlab.my-university.com
private_token = <here-is-your-token>
api_version = 4
[gitlab-general]
url = https://gitlab.com
api_version = 4
private_token = <here-is-your-token>
Once this configuration is in place, ``create-sibling-gitlab``'s ``--site`` parameter can be supplied with the name of the instance you want to use (e.g., ``datalad create-sibling-gitlab --site gitlab-general``).
Ensure that the token for each instance has appropriate permissions to create new groups and projects under your user account using the GitLab API in :numref:`fig-gitlabtoken`.
.. _fig-gitlabtoken:
.. figure:: ../artwork/src/gitlab-token.png
:width: 80%
Webinterface to generate an authentication token on GitLab. One typically has to set a name and
permission set, and potentially an expiration date.
Step 2: Create or select a group
""""""""""""""""""""""""""""""""
GitLab's organization consists of *projects* and *groups*.
Projects are single repositories, and groups can be used to manage one or more projects at the same time.
In order to use ``create-sibling-gitlab``, a user **must** `create a group <https://docs.gitlab.com/ee/user/group/#create-a-group>`_ via the web interface, or specify a pre-existing group, because `GitLab does not allow root-level groups to be created via their API <https://docs.gitlab.com/ee/api/groups.html#new-group>`_.
Only when there already is a "parent" group DataLad and other tools can create sub-groups and projects automatically.
In the screenshots :numref:`fig-rootgroup-gitlab1` and :numref:`fig-rootgroup-gitlab2`, a new group ``my-datalad-root-level-group`` is created right underneath the user account.
The group name as shown in the URL bar is what DataLad needs in order to create sibling datasets.
.. _fig-rootgroup-gitlab1:
.. figure:: ../artwork/src/gitlab-rootgroup.png
:width: 80%
Webinterface to create a root-level group on GitLab.
.. _fig-rootgroup-gitlab2:
.. figure:: ../artwork/src/gitlab-rootgroup2.png
:width: 80%
A created root-level group in GitLab's webinterface.
Step 3: Select a layout
"""""""""""""""""""""""
Due to the distinction between groups and projects, GitLab allows two different layouts that DataLad can use to publish datasets or dataset hierarchies:
* **flat**:
All datasets become projects in the same, pre-existing group.
The name of a project is its relative path within the root dataset, with all path separator characters replaced by '-' [#f4]_.
* **collection**:
A new group is created for the dataset. The root dataset (the topmost superdataset) is placed in a "project" project inside this group, and all nested subdatasets are represented inside the group using a "flat" layout [#f4]_. This layout is the default.
Consider the ``DataLad-101`` dataset, a superdataset with a several subdatasets in the following layout:
.. code-block:: bash
/home/me/dl-101/DataLad-101 # dataset
├── books/
│ └── [...]
├── code/
│ └── [...]
├── midterm_project/ # subdataset
│ ├── code/
│ └── [...]
│ └── input/ # sub-subdataset
├── recordings/
│ └── longnow/ # subdataset
│ ├── [...]
How the ``collection`` and ``flat`` layouts for this dataset look in practice is shown in :numref:`fig-gitlab-layout`.
.. _fig-gitlab-layout:
.. figure:: ../artwork/src/gitlab-layouts.png
:width: 50%
The ``collection`` layout has a group (``DataLad-101_collection``, defined by the user with a configuration) with four projects underneath. The ``project`` project contains the root-level dataset, and all contained subdatasets are named according to their location in the dataset. The ``flat`` layout consists of projects in the root-level group. The project name for the superdataset (``DataLad-101_flat``) is defined by the user with a configuration, and the names of the subdatasets extend this project name based on their location in the dataset hierarchy.
Publishing a single dataset
"""""""""""""""""""""""""""
When publishing a single dataset, users can configure the project or group name as a command argument ``--project``.
Here are two command examples and their outcomes.
For a **flat** layout, the ``--project`` parameter determines the project name, shown in :numref:`fig-gitlab-flat`.
.. code-block:: console
$ datalad create-sibling-gitlab --site gitlab-general --layout flat --project my-datalad-root-level-group/this-will-be-the-project-name
create_sibling_gitlab(ok): . (dataset) [sibling repository 'gitlab' created at https://gitlab.com/my-datalad-root-level-group/this-will-be-the-project-name]
configure-sibling(ok): . (sibling)
action summary:
configure-sibling (ok: 1)
create_sibling_gitlab (ok: 1)
.. _fig-gitlab-flat:
.. figure:: ../artwork/src/gitlab-layout-flat.png
:width: 50%
An example dataset using GitLab's "flat" layout.
For a **collection** layout, the ``--project`` parameter determines the group name, shown in figure :numref:`fig-gitlab-collection`.
.. code-block:: console
$ datalad create-sibling-gitlab --site gitlab-general --layout collection --project my-datalad-root-level-group/this-will-be-the-group-name
create_sibling_gitlab(ok): . (dataset) [sibling repository 'gitlab' created at https://gitlab.com/my-datalad-root-level-group/this-will-be-the-group-name/project]
configure-sibling(ok): . (sibling)
action summary:
configure-sibling (ok: 1)
create_sibling_gitlab (ok: 1)
.. _fig-gitlab-collection:
.. figure:: ../artwork/src/gitlab-layout-collection.png
:width: 50%
An example dataset using GitLab's "collection" layout.
Publishing datasets recursively
"""""""""""""""""""""""""""""""
When publishing a series of datasets recursively, the ``--project`` argument cannot be used anymore - otherwise, all datasets in the hierarchy would attempt to create the same group or project over and over again.
Instead, one configures the root level dataset, and the names for underlying datasets will be derived from this configuration:
.. index::
single: configuration item; datalad.gitlab-<name>-project
.. code-block:: console
$ # do the configuration for the top-most dataset
$ # either configure with Git
$ git config --local --replace-all \
datalad.gitlab-<gitlab-site>-project \
'my-datalad-root-level-group/DataLad-101_flat'
$ # or configure with DataLad
$ datalad configuration set \
datalad.gitlab-<gitlab-site>-project='my-datalad-root-level-group/DataLad-101_flat'
Afterwards, publish dataset hierarchies with the ``--recursive`` flag:
.. code-block:: console
$ datalad create-sibling-gitlab --site gitlab-general --recursive --layout flat
create_sibling_gitlab(ok): . (dataset) [sibling repository 'gitlab' created at https://gitlab.com/my-datalad-root-level-group/DataLad-101_flat]
configure-sibling(ok): . (sibling)
create_sibling_gitlab(ok): midterm_project (dataset) [sibling repository 'gitlab' created at https://gitlab.com/my-datalad-root-level-group/DataLad-101_flat-midterm_project]
configure-sibling(ok): . (sibling)
create_sibling_gitlab(ok): midterm_project/input (dataset) [sibling repository 'gitlab' created at https://gitlab.com/my-datalad-root-level-group/DataLad-101_flat-midterm_project-input]
configure-sibling(ok): . (sibling)
create_sibling_gitlab(ok): recordings/longnow (dataset) [sibling repository 'gitlab' created at https://gitlab.com/my-datalad-root-level-group/DataLad-101_flat-recordings-longnow]
configure-sibling(ok): . (sibling)
action summary:
configure-sibling (ok: 4)
create_sibling_gitlab (ok: 4)
Final step: Pushing to GitLab
"""""""""""""""""""""""""""""
Once you have set up your dataset sibling(s), you can push individual datasets with ``datalad push --to gitlab`` or push recursively across a hierarchy by adding the ``--recursive`` flag to the push command.
.. _gitea: https://about.gitea.com
.. rubric:: Footnotes
.. [#f1] Many repository hosting services have useful features to make your work citeable.
For example, :term:`gin` is able to assign a :term:`DOI` to your dataset, and GitHub allows ``CITATION.cff`` files. At the same time, archival services such as `Zenodo <https://zenodo.org>`_ often integrate with published repositories, allowing you to preserve your dataset with them.
.. [#f2] Your private SSH key is incredibly valuable, and it is important to keep
it secret!
Anyone who gets your private key has access to anything that the public key
is protecting. If the private key does not have a passphrase, simply copying
this file grants a person access!
.. [#f3] GitHub `deprecated user-password authentication <https://developer.github.com/changes/2020-02-14-deprecating-password-auth>`_ in favor of authentication via personal access token. Supplying a password instead of a token will fail to authenticate.
.. index::
single: configuration item; datalad.gitlab-default-projectname
single: configuration item; datalad.gitlab-default-pathseparator
.. [#f4] The default project name ``project`` and path separator ``-`` are configurable using the dataset-level configurations ``datalad.gitlab-default-projectname`` and ``datalad.gitlab-default-pathseparator``