datalad-handbook/docs/basics/101-114-txt2git.rst
2023-11-10 18:15:33 +01:00

105 lines
4.8 KiB
ReStructuredText

.. _text2git:
Data safety
-----------
Later in the day, after seeing and solving so many DataLad error messages,
you fall tired into your
bed. Just as you are about to fall asleep, a thought crosses your mind:
"I now know that tracked content in a dataset is protected by :term:`git-annex`.
Whenever tracked contents are ``saved``, they get locked and should not be
modifiable. But... what about the notes that I have been taking since the first day?
Should I not need to unlock them before I can modify them? And also the script!
I was able to modify this despite giving it to DataLad to track, with
no permission denied errors whatsoever! How does that work?"
This night, though, your question stays unanswered and you fall into a restless
sleep filled with bad dreams about "permission denied" errors. The next day you are
the first student in your lecturer's office hours.
"Oh, you are really attentive. This is a great question!" our lecturer starts
to explain.
.. figure:: ../artwork/src/teacher.svg
:width: 50%
.. index:: ! dataset procedure; text2git
Do you remember that we created the ``DataLad-101`` dataset with a
specific configuration template? It was the ``-c text2git`` option we
provided in the beginning of :ref:`createDS`. It is because of this configuration
that we can modify ``notes.txt`` without unlocking its content first.
The second commit message in our datasets history summarizes this (outputs are shortened):
.. runrecord:: _examples/DL-101-114-101
:language: console
:workdir: dl-101
:emphasize-lines: 3
:lines: 1-10
:realcommand: cd DataLad-101 && git log --reverse --oneline
:notes: Confusing: Why could we modify the tsv file without unlocking? The reason is in the dataset configuration with text2git
:cast: 03_git_annex_basics
$ git log --reverse --oneline
Instead of giving text files such as your notes or your script
to git-annex, the dataset stores it in :term:`Git`.
But what does it mean if files are in Git instead of git-annex?
Well, procedurally it means that everything that is stored in git-annex is
content-locked, and everything that is stored in Git is not. You can modify
content stored in Git straight away, without unlocking it first.
.. _fig-gitvsannex:
.. figure:: ../artwork/src/git_vs_gitannex.svg
:alt: A simplified illustration of content lock in files managed by git-annex.
:width: 50%
A simplified overview of the tools that manage data in your dataset.
That's easy enough, and illustrated in :numref:`fig-gitvsannex`.
"So, first of all: If we hadn't provided the ``-c text2git`` argument, text files
would get content-locked, too?". "Yes, indeed. However, there are also ways to
later change how file content is handled based on its type or size. It can be specified
in the ``.gitattributes`` file, using ``annex.largefile`` options.
But there will be a lecture on that [#f1]_."
"Okay, well, second: Isn't it much easier to just not bother with locking and
unlocking, and have everything 'stored in Git'? Even if :dlcmd:`run` takes care
of unlocking content, I do not see the point of git-annex", you continue.
Here it gets tricky. To begin with the most important, and most straight-forward fact:
It is not possible to store
large files in Git. This is because Git would very quickly run into severe performance
issues. And hosting sites for projects using Git, such as :term:`GitHub` or :term:`GitLab`
also do not allow files larger than a few dozen MB of size.
For now, we have solved the mystery of why text files can be modified
without unlocking, and this is a small
improvement in the vast amount of questions that have piled up in our curious
minds. Essentially, git-annex protects your data from accidental modifications
and thus keeps it safe. :dlcmd:`run` commands mitigate any technical
complexity of this completely if ``-o/--output`` is specified properly, and
:dlcmd:`unlock` commands can be used to unlock content "by hand" if
modifications are performed outside of a :dlcmd:`run`.
.. index::
pair: adjusted mode; git-annex concept
But there comes the second, tricky part: There are ways to get rid of locking and
unlocking within git-annex, using so-called :term:`adjusted branch`\es.
This functionality is dependent on the git-annex version one has installed, the git-annex version of the repository, and a use-case dependent comparison of the pros and cons.
On Windows systems, this *adjusted mode* is even the *only* mode of operation.
In later sections we will see how to use this feature.
The next lecture, in any way, will guide us deeper into git-annex, and improve our understanding a slight bit further.
.. rubric:: Footnotes
.. [#f1] If you cannot wait to read about ``.gitattributes`` and other
configuration files, jump ahead to chapter :ref:`chapter_config`,
starting with section :ref:`config`.