datalad-handbook/docs/basics/101-109-rerun.rst

319 lines
11 KiB
ReStructuredText

.. index::
pair: rerun; DataLad command
.. _run2:
DataLad, rerun!
----------------
So far, you created a ``.tsv`` file of all
speakers and talk titles in the ``longnow/`` podcasts subdataset.
Let's actually take a look into this file now:
.. runrecord:: _examples/DL-101-109-101
:language: console
:workdir: dl-101/DataLad-101
:lines: 1-3,5-7
:append: -✂--✂-
:notes: The script produced a simple list of podcast titles. let's take a look into our output file. What's cool is that is was created in a way that the code and output are linked:
:cast: 02_reproducible_execution
$ less recordings/podcasts.tsv
Not too bad, and certainly good enough for the podcast night people.
What's been cool about creating this file is that it was created with
a script within a :dlcmd:`run` command. Thanks to :dlcmd:`run`,
the output file ``podcasts.tsv`` is associated with the script it
generated.
Upon reviewing the list you realized that you made a mistake, though: you only
listed the talks in the SALT series (the
``Long_Now__Seminars_About_Long_term_Thinking/`` directory), but not
in the ``Long_Now__Conversations_at_The_Interval/`` directory.
Let's fix this in the script. Replace the contents in ``code/list_titles.sh``
with the following, fixed script:
.. windows-wit:: Here's a script adjustment for Windows users
.. include:: topic/globscript2-windows.rst
.. runrecord:: _examples/DL-101-109-102
:language: console
:workdir: dl-101/DataLad-101
:emphasize-lines: 2
:notes: Dang, we made a mistake in our script: we only listed a part of the podcasts! Let's fix the script:
:cast: 02_reproducible_execution
$ cat << EOT >| code/list_titles.sh
for i in recordings/longnow/Long_Now*/*.mp3; do
# get the filename
base=\$(basename "\$i");
# strip the extension
base=\${base%.mp3};
printf "\${base%%__*}\t" | tr '_' '-';
# name and title without underscores
printf "\${base#*__}\n" | tr '_' ' ';
done
EOT
Because the script is now modified, save the modifications to the dataset.
We can use the shorthand "BF" to denote "Bug fix" in the commit message.
.. runrecord:: _examples/DL-101-109-103
:language: console
:workdir: dl-101/DataLad-101
:cast: 02_reproducible_execution
$ datalad status
.. runrecord:: _examples/DL-101-109-104
:language: console
:workdir: dl-101/DataLad-101
:cast: 02_reproducible_execution
$ datalad save -m "BF: list both directories content" \
code/list_titles.sh
What we *could* do is run the same :dlcmd:`run` command as before to recreate
the file, but now with all of the contents:
.. code-block:: console
$ # do not execute this!
$ datalad run -m "create a list of podcast titles" \
"bash code/list_titles.sh > recordings/podcasts.tsv"
However, think about any situation where the command would be longer than this,
or that is many months past the first execution. It would not be easy to remember
the command, nor would it be very convenient to copy it from the ``run record``.
Luckily, a fellow student remembered the DataLad way of re-executing
a ``run`` command, and he's eager to show it to you.
"In order to re-execute a :dlcmd:`run` command,
find the commit and use its :term:`shasum` (or a :term:`tag`, or anything else that Git
understands) as an argument for the
:dlcmd:`rerun` command! That's it!",
he says happily.
So you go ahead and find the commit :term:`shasum` in your history:
.. runrecord:: _examples/DL-101-109-105
:language: console
:workdir: dl-101/DataLad-101
:lines: 1-12
:emphasize-lines: 8
:notes: We could execute the same command as before. However, we can also let DataLad take care of it, and use the datalad rerun command.
:cast: 02_reproducible_execution
$ git log -n 2
Take that shasum and paste it after :dlcmd:`rerun`
(the first 6-8 characters of the shasum would be sufficient,
here we are using all of them).
.. runrecord:: _examples/DL-101-109-106
:language: console
:workdir: dl-101/DataLad-101
:realcommand: echo "$ datalad rerun $(git rev-parse HEAD~1)" && datalad rerun $(git rev-parse HEAD~1)
:notes: We'll find the shasum of the run commit and plug it into rerun
:cast: 02_reproducible_execution
Now DataLad has made use of the ``run record``, and
re-executed the original command based on the information in it.
Because we updated the script, the output ``podcasts.tsv``
has changed and now contains the podcast
titles of both subdirectories.
You've probably already guessed it, but the easiest way
to check whether a :dlcmd:`rerun`
has changed the desired output file is
to check whether the rerun command appears in the datasets history:
If a :dlcmd:`rerun` does not add or change any content in the dataset,
it will also not be recorded in the history.
.. runrecord:: _examples/DL-101-109-107
:language: console
:workdir: dl-101/DataLad-101
:notes: how does a rerun look in the history?
:cast: 02_reproducible_execution
$ git log -n 1
In the dataset's history,
we can see that a new :dlcmd:`run` was recorded. This action is
committed by DataLad under the original commit message of the ``run``
command, and looks just like the previous :dlcmd:`run` commit.
.. index::
pair: diff; DataLad command
Two cool tools that go beyond the :gitcmd:`log`
are the :dlcmd:`diff` and :gitcmd:`diff` commands.
Both commands can report differences between two states of
a dataset. Thus, you can get an overview of what changed between two commits.
Both commands have a similar, but not identical structure: :dlcmd:`diff`
compares one state (a commit specified with ``-f``/``--from``,
by default the latest change)
and another state from the dataset's history (a commit specified with
``-t``/``--to``). Let's do a :dlcmd:`diff` between the current state
of the dataset and the previous commit (called "``HEAD~1``" in Git terminology [#f1]_):
.. index::
pair: show dataset modification; on Windows with DataLad
pair: diff; DataLad command
pair: corresponding branch; in adjusted mode
.. windows-wit:: please use 'datalad diff --from main --to HEAD~1'
.. include:: topic/adjustedmode-diff.rst
.. index::
pair: diff; Git command
pair: show dataset modification; with DataLad
.. runrecord:: _examples/DL-101-109-108
:language: console
:workdir: dl-101/DataLad-101
:notes: The datalad diff command can help us find out what changed between the last two commands:
:cast: 02_reproducible_execution
$ datalad diff --to HEAD~1
.. index::
pair: diff; Git command
pair: show dataset modification; with Git
This indeed shows the output file as "modified". However, we do not know
what exactly changed. This is a task for :gitcmd:`diff` (get out of the
diff view by pressing ``q``):
.. runrecord:: _examples/DL-101-109-109
:language: console
:workdir: dl-101/DataLad-101
:notes: The git diff command has even more insights:
:cast: 02_reproducible_execution
:lines: 1-20
$ git diff HEAD~1
This output actually shows the precise changes between the contents created
with the first version of the script and the second script with the bug fix.
All of the files that are added after the second directory
was queried as well are shown in the ``diff``, preceded by a ``+``.
Quickly create a note about these two helpful commands in ``notes.txt``:
.. runrecord:: _examples/DL-101-109-110
:language: console
:workdir: dl-101/DataLad-101
:notes: Let's make a note about this.
:cast: 02_reproducible_execution
$ cat << EOT >> notes.txt
There are two useful functions to display changes between two
states of a dataset: "datalad diff -f/--from COMMIT -t/--to COMMIT"
and "git diff COMMIT COMMIT", where COMMIT is a shasum of a commit
in the history.
EOT
Finally, save this note.
.. runrecord:: _examples/DL-101-109-111
:language: console
:workdir: dl-101/DataLad-101
:cast: 02_reproducible_execution
$ datalad save -m "add note datalad and git diff"
Note that :dlcmd:`rerun` can re-execute the run records of both a :dlcmd:`run`
or a :dlcmd:`rerun` command,
but not with any other type of DataLad command in your history
such as a :dlcmd:`save` on results or outputs after you executed a script.
Therefore, make it a
habit to record the execution of scripts by plugging it into :dlcmd:`run`.
This very basic example of a :dlcmd:`run` is as simple as it can get, but it
is already
convenient from a memory-load perspective: Now you do not need to
remember the commands or scripts involved in creating an output. DataLad kept track
of what you did, and you can instruct it to "``rerun``" it.
Also, incidentally, we have generated :term:`provenance` information. It is
now recorded in the history of the dataset how the output ``podcasts.tsv`` came
into existence. And we can interact with and use this provenance information with
other tools than from the machine-readable ``run record``.
For example, to find out who (or what) created or modified a file,
give the file path to :gitcmd:`log` (prefixed by ``--``):
.. index::
pair: show history for particular paths; on Windows with Git
pair: log; Git command
pair: corresponding branch; in adjusted mode
.. windows-wit:: use 'git log main -- recordings/podcasts.tsv'
.. include:: topic/adjustedmode-log-path.rst
.. index::
pair: show history for particular paths; with Git
.. runrecord:: _examples/DL-101-109-112
:language: console
:workdir: dl-101/DataLad-101
:notes: An amazing thing is that DataLad captured all of the provenance of the output file, and we get use git tools to find out about it
:cast: 02_reproducible_execution
$ git log -- recordings/podcasts.tsv
Neat, isn't it?
Still, this :dlcmd:`run` was very simple.
The next section will demonstrate how :dlcmd:`run` becomes handy in
more complex standard use cases: situations with *locked* contents.
But prior to that, make a note about :dlcmd:`run` and :dlcmd:`rerun` in your
``notes.txt`` file.
.. runrecord:: _examples/DL-101-109-113
:language: console
:workdir: dl-101/DataLad-101
:notes: Another final note on run and rerun
:cast: 02_reproducible_execution
$ cat << EOT >> notes.txt
The datalad run command can record the impact a script or command has
on a Dataset. In its simplest form, datalad run only takes a commit
message and the command that should be executed.
Any datalad run command can be re-executed by using its commit shasum
as an argument in datalad rerun CHECKSUM. DataLad will take
information from the run record of the original commit, and re-execute
it. If no changes happen with a rerun, the command will not be written
to history. Note: you can also rerun a datalad rerun command!
EOT
Finally, save this note.
.. runrecord:: _examples/DL-101-109-114
:language: console
:workdir: dl-101/DataLad-101
:notes: Another final note on run and rerun
:cast: 02_reproducible_execution
$ datalad save -m "add note on basic datalad run and datalad rerun"
.. only:: adminmode
Add a tag at the section end.
.. runrecord:: _examples/DL-101-109-115
:language: console
:workdir: dl-101/DataLad-101
$ git branch sct_datalad_rerun
.. rubric:: Footnotes
.. [#f1] The section :ref:`history` will elaborate more on common :term:`Git` commands
and terminology.