datalad-handbook/docs/basics/101-180-FAQ.rst
2024-02-28 10:26:07 +01:00

624 lines
30 KiB
ReStructuredText
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

.. _FAQ:
Frequently asked questions
--------------------------
This section answers frequently asked questions about high-level DataLad
concepts or commands. If you have a question you want to see answered in here,
`please create an issue <https://github.com/datalad-handbook/book/issues/new>`_
or a `pull request <https://handbook.datalad.org/contributing.html>`_.
For a series of specialized command snippets for various use cases, please see
section :ref:`gists`.
What is Git?
^^^^^^^^^^^^
Git is a free and open source distributed version control system. In a
directory that is initialized as a Git repository, it can track small-sized
files and the modifications done to them.
Git thinks of its data like a *series of snapshots* -- it basically takes a
picture of what all files look like whenever a modification in the repository
is saved. It is a powerful and yet small and fast tool with many features such
as *branching and merging* for independent development, *checksumming* of
contents for integrity, and *easy collaborative workflows* thanks to its
distributed nature.
DataLad uses Git underneath the hood. Every DataLad dataset is a Git
repository, and you can use any Git command within a DataLad dataset. Based
on the configurations in ``.gitattributes``, file content can be version
controlled by Git or managed by git-annex, based on path pattern, file types,
or file size. The section :ref:`config2` details how these configurations work.
`This chapter <https://git-scm.com/book/en/v2/Getting-Started-What-is-Git%3F>`_
gives a comprehensive overview on what Git is.
Where is Git's "staging area" in DataLad datasets?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
As mentioned in :ref:`populate`, a local version control workflow with
DataLad "skips" the staging area (that is typical for Git workflows) from the
user's point of view.
What is git-annex?
^^^^^^^^^^^^^^^^^^
git-annex (`https://git-annex.branchable.com/ <https://git-annex.branchable.com>`_)
is a distributed file synchronization system written by Joey Hess. It can
share and synchronize large files independent from a commercial service or a
central server. It does so by managing all file *content* in a separate
directory (the *annex*, *object tree*, or *key-value-store* in ``.git/annex/objects/``),
and placing only file names and
metadata into version control by Git. Among many other features, git-annex
can ensure sufficient amounts of file copies to prevent accidental data loss and
enables a variety of data transfer mechanisms.
DataLad uses git-annex underneath the hood for file content tracking and
transport logistics. git-annex offers an astonishing range of functionality
that DataLad tries to expose in full. That being said, any DataLad dataset
(with the exception of datasets configured to be pure Git repositories) is
fully compatible with git-annex -- you can use any git-annex command inside a
DataLad dataset.
The chapter :ref:`chapter_gitannex` can give you more insights into how git-annex
takes care of your data. git-annex's `website <https://git-annex.branchable.com>`_
can give you a complete walk-through and detailed technical background
information.
What does DataLad add to Git and git-annex?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
DataLad sits on top of Git and git-annex and tries to integrate and expose
their functionality fully. While DataLad thus is a "thin layer" on top of
these tools and tries to minimize the use of unique/idiosyncratic functionality,
it also tries to simplify working with repositories and adds a range of useful concepts
and functions:
- Both Git and git-annex are made to work with a single repository at a time.
For example, while nesting pure Git repositories is possible via Git
submodules (that DataLad also uses internally), *cleaning up* after
placing a random file somewhere into this repository hierarchy can be very
painful. A key advantage that DataLad brings to the table is that it makes
the boundaries between repositories vanish from a user's point
of view. Most core commands have a ``--recursive`` option that will discover
and traverse any subdatasets and do-the-right-thing.
Whereas git and git-annex would require the caller to first cd to the target
repository, DataLad figures out which repository the given paths belong to and
then works within that repository.
:dlcmd:`save . --recursive` will solve the subdataset problem above,
for example, no matter what was changed/added, no matter where in a tree
of subdatasets.
- DataLad provides users with the ability to act on "virtual" file paths. If
software needs data files that are carried in a subdataset (in Git terms:
submodule) for a computation or test, a ``datalad get`` will discover if
there are any subdatasets to install at a particular version to eventually
provide the file content.
- DataLad adds metadata facilities for metadata extraction in various flavors,
and can store extracted and aggregated metadata under ``.datalad/metadata``.
.. todo::
more here.
What kind of data is compatible with DataLad datasets?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Any information that can be expressed in digital files.
This includes text files, tabular data, images, in any file format, and of any size and number.
What does a DataLad dataset contain? In what way is it “lightweight”?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Simply speaking, a DataLad dataset only contains metadata.
Metadata on the identity and availability of data.
On a computer with the DataLad software installed, a DataLad dataset looks like any other folder with files.
However, file content is only obtained on-access.
When file content needs to be accessed, it is downloaded from any of the known storage locations for a file.
It can only be downloaded, if the requesting user has the necessary authorization to access a file.
Does DataLad host my data?
^^^^^^^^^^^^^^^^^^^^^^^^^^
No, DataLad manages your data, but it does not host it. When publishing a
dataset with annexed data, you will need to find a place that the large file
content can be stored in -- this could be a web server, a cloud service such
as `Dropbox <https://www.dropbox.com>`_, an S3 bucket, or many other storage
solutions -- and set up a publication dependency on this location.
This gives you all the freedom to decide where your data lives, and who can
have access to it. Once this set up is complete, publishing and accessing a
published dataset and its data are as easy as if it would lie on your own
machine.
You can find a typical workflow in the chapter :ref:`chapter_thirdparty`.
Is my data automatically "open" when I publish it with DataLad?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
How openly your data is published is your choice.
Combined with a fitting hosting provider, you can make all your data openly available, or available only partially, or to a specific audience, or only in the form of metadata.
Can I selectively publish only some data?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Yes. Or even just selected metadata.
How does GitHub relate to DataLad?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
DataLad can make good use of GitHub, if you have figured out storage for your
large files otherwise. You can make DataLad publish file content to one location
and afterwards automatically push an update to GitHub, such that
users can install directly from GitHub and seemingly also obtain large file
content from GitHub. GitHub is also capable of resolving submodule/subdataset
links to other GitHub repos, which makes for a nice UI.
Does DataLad scale to large dataset sizes?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
In general, yes. The largest dataset managed by DataLad at this point is the `Human Connectome Project <http://www.humanconnectomeproject.org>`_ data, encompassing 80 Terabytes of data in 15 million files, and larger projects (up to 500TB) are currently actively worked on.
The chapter :ref:`chapter_gobig` is a guide to "beyond-household-quantity datasets".
What is the difference between a superdataset, a subdataset, and a dataset?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Conceptually and technically, there is no difference between a dataset, a
subdataset, or a superdataset. The only aspect that makes a dataset a sub- or
superdataset is whether it is *registered* in another dataset (by means of an entry in the
``.gitmodules``, automatically performed upon an appropriate ``datalad
clone -d`` or ``datalad create -d`` command) or contains registered datasets.
How can I convert/import/transform an existing Git or git-annex repository into a DataLad dataset?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
You can transform any existing Git or git-annex repository of yours into a
DataLad dataset by running:
.. code-block:: console
$ datalad create -f
inside of it. Afterwards, you may want to tweak settings in ``.gitattributes``
according to your needs (see sections :ref:`config` and :ref:`config2` for
additional insights on this).
The chapter :ref:`chapter_retro` guides you through transitioning an existing project into DataLad.
How can I convert an existing DataLad dataset with annexed data back to a plain Git repository?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If you decide to stop using git-annex or DataLad, or if you want to turn an annex repo back into a Git repo, you can do so with the git-annex uninit command.
The section :ref:`uninit` contains more details.
What does DataLad cost?
^^^^^^^^^^^^^^^^^^^^^^^
DataLad is free and open source software. There are no fees, no running costs.
A necessary investment is to learn how to use this tool.
Who develops DataLad?
^^^^^^^^^^^^^^^^^^^^^
DataLad is an international academic open source project with more than a hundred contributors, spearheaded by a US-German collaboration between Dartmouth College and the Research Centre Jülich.
How can I cite DataLad?
^^^^^^^^^^^^^^^^^^^^^^^
Please cite the official paper on DataLad:
Halchenko et al., (2021). DataLad: distributed system for joint management of code, data, and their relationship. Journal of Open Source Software, 6(63), 3262, `https://doi.org/10.21105/joss.03262 <https://doi.org/10.21105/joss.03262>`_.
.. _dataset_textblock:
How can I help others get started with a shared dataset?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If you want to share your dataset with users that are not already familiar with
DataLad, it is helpful to include some information on how to interact with
DataLad datasets in your dataset's ``README`` (or similar) file.
If you do not want to invent a description yourself, you can run
:dlcmd:`add-readme` in your dataset, and have one added automatically.
.. only:: html
Below, we provide a standard text block that you can use (and adapt as you
wish) for such purposes.
.. find-out-more:: Textblock in .rst format:
.. code-block:: rst
DataLad datasets and how to use them
------------------------------------
This repository is a `DataLad <https://www.datalad.org>`__ dataset. It provides
fine-grained data access down to the level of individual files, and allows for
tracking future updates. In order to use this repository for data retrieval,
`DataLad <https://www.datalad.org>`_ is required.
It is a free and open source command line tool, available for all
major operating systems, and builds up on Git and `git-annex
<https://git-annex.branchable.com>`__ to allow sharing, synchronizing, and
version controlling collections of large files. You can find information on
how to install DataLad at `handbook.datalad.org/intro/installation.html
<https://handbook.datalad.org/intro/installation.html>`_.
Get the dataset
^^^^^^^^^^^^^^^
A DataLad dataset can be ``cloned`` by running:
.. code-block:: bash
datalad clone <url>
Once a dataset is cloned, it is a light-weight directory on your local machine.
At this point, it contains only small metadata and information on the
identity of the files in the dataset, but not actual *content* of the
(sometimes large) data files.
Retrieve dataset content
^^^^^^^^^^^^^^^^^^^^^^^^
After cloning a dataset, you can retrieve file contents by running:
.. code-block:: bash
datalad get <path/to/directory/or/file>
This command will trigger a download of the files, directories, or
subdatasets you have specified.
DataLad datasets can contain other datasets, so called *subdatasets*. If you
clone the top-level dataset, subdatasets do not yet contain metadata and
information on the identity of files, but appear to be empty directories. In
order to retrieve file availability metadata in subdatasets, run:
.. code-block:: bash
datalad get -n <path/to/subdataset>
Afterwards, you can browse the retrieved metadata to find out about
subdataset contents, and retrieve individual files with ``datalad get``. If you
use ``datalad get <path/to/subdataset>``, all contents of the subdataset will
be downloaded at once.
Stay up-to-date
^^^^^^^^^^^^^^^
DataLad datasets can be updated. The command ``datalad update`` will *fetch*
updates and store them on a different branch (by default
``remotes/origin/main``). Running
.. code-block:: bash
datalad update --merge
will *pull* available updates and integrate them in one go.
Find out what has been done
^^^^^^^^^^^^^^^^^^^^^^^^^^^
DataLad datasets contain their history in the ``git log``.
By running ``git log`` (or a tool that displays Git history) in the dataset or on
specific files, you can find out what has been done to the dataset or to individual files
by whom, and when.
More information
^^^^^^^^^^^^^^^^
More information on DataLad and how to use it can be found in the DataLad Handbook at
`handbook.datalad.org <https://handbook.datalad.org>`_. The
chapter "DataLad datasets" can help you to familiarize yourself with the
concept of a dataset.
.. find-out-more:: Textblock in markdown format
.. code-block:: md
[![made-with-datalad](https://www.datalad.org/badges/made_with.svg)](https://datalad.org)
## DataLad datasets and how to use them
This repository is a [DataLad](https://www.datalad.org/) dataset. It provides
fine-grained data access down to the level of individual files, and allows for
tracking future updates. In order to use this repository for data retrieval,
[DataLad](https://www.datalad.org/) is required. It is a free and
open source command line tool, available for all major operating
systems, and builds up on Git and [git-annex](https://git-annex.branchable.com/)
to allow sharing, synchronizing, and version controlling collections of
large files. You can find information on how to install DataLad at
[handbook.datalad.org/intro/installation.html](https://handbook.datalad.org/intro/installation.html).
### Get the dataset
A DataLad dataset can be `cloned` by running
```
datalad clone <url>
```
Once a dataset is cloned, it is a light-weight directory on your local machine.
At this point, it contains only small metadata and information on the
identity of the files in the dataset, but not actual *content* of the
(sometimes large) data files.
### Retrieve dataset content
After cloning a dataset, you can retrieve file contents by running
```
datalad get <path/to/directory/or/file>
```
This command will trigger a download of the files, directories, or
subdatasets you have specified.
DataLad datasets can contain other datasets, so called *subdatasets*.
If you clone the top-level dataset, subdatasets do not yet contain
metadata and information on the identity of files, but appear to be
empty directories. In order to retrieve file availability metadata in
subdatasets, run
```
datalad get -n <path/to/subdataset>
```
Afterwards, you can browse the retrieved metadata to find out about
subdataset contents, and retrieve individual files with `datalad get`.
If you use `datalad get <path/to/subdataset>`, all contents of the
subdataset will be downloaded at once.
### Stay up-to-date
DataLad datasets can be updated. The command `datalad update` will
*fetch* updates and store them on a different branch (by default
`remotes/origin/main`). Running
```
datalad update --merge
```
will *pull* available updates and integrate them in one go.
### Find out what has been done
DataLad datasets contain their history in the `git log`.
By running `git log` (or a tool that displays Git history) in the dataset or on
specific files, you can find out what has been done to the dataset or to individual files
by whom, and when.
### More information
More information on DataLad and how to use it can be found in the DataLad Handbook at
[handbook.datalad.org](https://handbook.datalad.org/index.html). The chapter
"DataLad datasets" can help you to familiarize yourself with the concept of a dataset.
.. find-out-more:: Textblock without formatting
.. code-block:: md
DataLad datasets and how to use them
This repository is a DataLad (https://www.datalad.org) dataset. It provides
fine-grained data access down to the level of individual files, and allows
for tracking future updates. In order to use this repository for data
retrieval, DataLad is required. It is a free and open source command line
tool, available for all major operating systems, and builds up on Git and
git-annex (https://git-annex.branchable.com) to allow sharing,
synchronizing, and version controlling collections of large files. You can
find information on how to install DataLad at
https://handbook.datalad.org/intro/installation.html.
Get the dataset
A DataLad dataset can be "cloned" by running 'datalad clone <url>'.
Once a dataset is cloned, it is a light-weight directory on your local
machine.
At this point, it contains only small metadata and information on the
identity of the files in the dataset, but not actual *content* of the
(sometimes large) data files.
Retrieve dataset content
After cloning a dataset, you can retrieve file contents by running
'datalad get <path/to/directory/or/file>'
This command will trigger a download of the files, directories, or
subdatasets you have specified.
DataLad datasets can contain other datasets, so called "subdatasets".
If you clone the top-level dataset, subdatasets do not yet contain
metadata and information on the identity of files, but appear to be
empty directories. In order to retrieve file availability metadata in
subdatasets, run 'datalad get -n <path/to/subdataset>'
Afterwards, you can browse the retrieved metadata to find out about
subdataset contents, and retrieve individual files with 'datalad get'.
If you use 'datalad get <path/to/subdataset>', all contents of the
subdataset will be downloaded at once.
Stay up-to-date
DataLad datasets can be updated. The command 'datalad update' will
"fetch" updates and store them on a different branch (by default
'remotes/origin/main'). Running 'datalad update --merge' will "pull"
available updates and integrate them in one go.
Find out what has been done
DataLad datasets contain their history in the Git log.
By running 'git log' (or a tool that displays Git history) in the dataset or on
specific files, you can find out what has been done to the dataset or to individual files
by whom, and when.
More information
More information on DataLad and how to use it can be found in the DataLad Handbook at
https://handbook.datalad.org/index.html. The chapter "DataLad datasets"
can help you to familiarize yourself with the concept of a dataset.
What is the difference between DataLad, Git LFS, and Flywheel?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
`Flywheel <https://flywheel.io>`_ is an informatics platform for biomedical
research and collaboration.
`Git Large File Storage <https://github.com/git-lfs/git-lfs>`_ (Git LFS) is a
command line tool that extends Git with the ability to manage large files. In
that it appears similar to git-annex.
.. todo::
this.
A more elaborate delineation from related solutions can be found in the DataLad
`developer documentation <https://docs.datalad.org/related.html>`_.
What is the difference between DataLad and DVC?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
`DVC <https://dvc.org>`_ is a version control system for machine learning projects.
We have compared the two tools in a dedicated handbook section, :ref:`dvc`.
DataLad version-controls my large files -- great. But how much is saved in total?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. todo::
this.
.. _copydata:
How can I copy data out of a DataLad dataset?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Moving or copying data out of a DataLad dataset is always possible and works in
many cases just like in any regular directory. The only
caveat exists in the case of annexed data: If file content is managed with
git-annex and stored in the :term:`object-tree`, what *appears* to be the
file in the dataset is merely a symlink (please read section :ref:`symlink`
for details). Moving or copying this symlink will not yield the
intended result -- instead you will have a broken symlink outside of your
dataset.
When using the terminal command ``cp`` [#f1]_, it is sufficient to use the
``-L``/``--dereference`` option. This will follow symbolic links, and make
sure that content gets moved instead of symlinks.
Remember that if you are copying some annexed content out of a dataset without
unlocking it first, you will only have "read" :term:`permissions` on the files you have just
copied. Therefore you can :
- either unlock the files before copying them out,
- or copy them and then use the command ``chmod`` to be able to edit the file.
.. code-block:: console
$ # this will give you 'write' permission on the file
$ chmod +w filename
If you are not familiar with how the ``chmod`` works (or if you forgot - let's be honest we
all google it sometimes), this is `a nice tutorial <https://bids.github.io/2015-06-04-berkeley/shell/07-perm.html>`_ .
With tools other than ``cp`` (e.g., graphical file managers), to copy or move
annexed content, make sure it is *unlocked* first:
After a :dlcmd:`unlock` copying and moving contents will work fine.
A subsequent :dlcmd:`save` in the dataset will annex the content
again.
Is there Python 2 support for DataLad?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
No, Python 2 support has been dropped in
`September 2019 <https://github.com/datalad/datalad/pull/3629>`_.
Is there a graphical user interface for DataLad?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Yes, a dedicated :term:`DataLad extension`, ``datalad-gooey``, provides a graphical user interface for DataLad.
You can read more about it in the section :ref:`gooey`.
How does DataLad interface with OpenNeuro?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
`OpenNeuro <https://openneuro.org>`_ is a free and open platform for sharing MRI,
MEG, EEG, iEEG, and ECoG data. It publishes hosted data as DataLad datasets on
:term:`GitHub`. The entire collection can be found at
`github.com/OpenNeuroDatasets <https://github.com/OpenNeuroDatasets>`_. You can
obtain the datasets just as any other DataLad datasets with :dlcmd:`clone`
or :dlcmd:`install`.
There is more info about this in the :ref:`OpenNeuro Quickstart Guide <openneuro>`.
How does DataLad process the data given to it?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
DataLad does not modify file content, or enforce a particular data organization inside a dataset.
From a user-perspective, a DataLad dataset is a regular directory on a computers file system.
This directory is populated with files that a user has placed into this directory.
DataLad manages identity information of these files over time (version controlled content identifiers, typically based on checksums).
DataLad also assists with file transport (upload/download) to and from this directory, and tracks the associated file content availability metadata.
Users may also associate arbitrary additional metadata with any file content or dataset version.
Any and all metadata and DataLad-internal management information is kept separate from the managed content, but located inside the managed directory (in a .git subdirectory).
No information is transmitted to a location outside this local directory, unless a user explicitly performs such an action.
DataLad is agnostic with respect to the content of a file it manages within a dataset.
DataLad reads all file content in binary form for the sole purpose of computing a content identifier, which is typically based on a checksum (e.g., MD5 or SHA1).
This content identifier is used to associate file content availability and other metadata.
DataLad supports the execution of user-defined, user-provided metadata extractor algorithms.
These software components can process files of a particular format, in order to derive metadata from it.
Volume, format and terminology of such metadata are determined by the provider of an extractor implementation, and a users parameterization.
DataLad also supports the execution of user-defined programs and scripts.
When executed through DataLad, users can record inputs and parameters of such a process, and DataLad can capture the identity of any generated output files.
This enables metadata-based queries on the origin of files, and programmatic recomputing.
.. _bidsvalidator:
BIDS validator issues in datasets with missing file content
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
As outlined in section :ref:`symlink`, all unretrieved files in datasets are broken symlinks.
This is desired, and not a problem per se, but some tools, among them the `BIDS validator <https://github.com/bids-standard/bids-validator>`_, can be confused by this.
Should you attempt to validate a dataset in which all or some file contents are missing, for example after cloning a dataset or after dropping file contents, the validator may fail to report on the validity of the complete dataset or the specific unretrieved files.
If you aim for a complete validation of your dataset, re-do the validation after retrieving all necessary file contents.
If you only aim to validate file names and structure, invoke the bids validator with the additional flags ``--ignoreNiftiHeaders`` and ``--ignoreSymlinks``.
.. _gitannexbranch:
What is the git-annex branch?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If your DataLad dataset contains an annex, there is also a ``git-annex`` :term:`branch`
that is created, used, and maintained solely by :term:`git-annex`. It is completely
unconnected to any other branches in your dataset, and contains different types
of log files.
The contents of this branch are used for git-annex internal tracking of the
dataset and its annexed contents. For example, git-annex stores information where
file content can be retrieved from in a ``.log`` file for each object, and if the object
was obtained from web-sources (e.g., with :dlcmd:`download-url`), a
``.log.web`` file stores the URL. Other files in this branch store information about
the known remotes of the dataset and their description, if they have one.
You can find out much more about the ``git-annex`` branch and its contents in the
`documentation <https://git-annex.branchable.com/internals>`_.
This branch, however, is managed by git-annex, and you should not tamper with it.
.. _gitannexdefault:
Help - Why does Github display my dataset with git-annex as the default branch?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If your dataset is represented on GitHub with cryptic directories instead of actual file names, GitHub probably declared the :term:`git-annex branch` to be your repositories "default branch".
Here is an example:
.. figure:: ../artwork/src/defaultgitannex_light.png
This is related to GitHub's decision to make ``main`` `the default branch for newly created repositories <https://github.blog/changelog/2020-10-01-the-default-branch-for-newly-created-repositories-is-now-main>`_ -- datasets that do not have a ``main`` branch (but, for example, a ``master`` branch) may end up with a different branch being displayed on GitHub than intended.
To fix this for present and/or future datasets, the default branch can be configured to a branch name of your choice on a repository- or organizational level `via GitHub's web-interface <https://github.blog/changelog/2020-08-26-set-the-default-branch-for-newly-created-repositories>`_.
Alternatively, you can rename existing ``master`` branches into ``main`` using ``git branch -m master main`` (but beware of unforeseen consequences - your collaborators may try to ``update`` the ``master`` branch but fail, continuous integration workflows could still try to use ``master``, etc.).
Lastly, you can initialize new datasets with ``main`` instead of ``master`` -- either with a global Git configuration [#f2]_ for ``init.defaultBranch`` (``git config --global init.defaultBranch main``), or by passing the ``--initial-branch <branchname>`` option via ``datalad create`` by appending ``--initial-branch main`` to the command (``datalad create mydataset --initial-branch main``) [#f3]_.
.. rubric:: Footnotes
.. [#f1] The absolutely amazing `Midnight Commander <https://github.com/MidnightCommander/mc>`_
``mc`` can also follow symlinks.
.. [#f2] See the section :ref:`config` for more info on configurations
.. [#f3] ``--initial-branch`` is not one of ``datalad create``'s parameters, but a parameter of a ``git init`` call. You can specify any of ``git init``'s parameters as the last arguments of ``datalad create`` (after the ``PATH``) and it will be passed to ``git init``.