319 lines
11 KiB
ReStructuredText
319 lines
11 KiB
ReStructuredText
.. index::
|
|
pair: rerun; DataLad command
|
|
.. _run2:
|
|
|
|
DataLad, rerun!
|
|
----------------
|
|
|
|
So far, you created a ``.tsv`` file of all
|
|
speakers and talk titles in the ``longnow/`` podcasts subdataset.
|
|
Let's actually take a look into this file now:
|
|
|
|
.. runrecord:: _examples/DL-101-109-101
|
|
:language: console
|
|
:workdir: dl-101/DataLad-101
|
|
:lines: 1-3,5-7
|
|
:append: -✂--✂-
|
|
:notes: The script produced a simple list of podcast titles. let's take a look into our output file. What's cool is that is was created in a way that the code and output are linked:
|
|
:cast: 02_reproducible_execution
|
|
|
|
$ less recordings/podcasts.tsv
|
|
|
|
Not too bad, and certainly good enough for the podcast night people.
|
|
What's been cool about creating this file is that it was created with
|
|
a script within a :dlcmd:`run` command. Thanks to :dlcmd:`run`,
|
|
the output file ``podcasts.tsv`` is associated with the script it
|
|
generated.
|
|
|
|
Upon reviewing the list you realized that you made a mistake, though: you only
|
|
listed the talks in the SALT series (the
|
|
``Long_Now__Seminars_About_Long_term_Thinking/`` directory), but not
|
|
in the ``Long_Now__Conversations_at_The_Interval/`` directory.
|
|
Let's fix this in the script. Replace the contents in ``code/list_titles.sh``
|
|
with the following, fixed script:
|
|
|
|
.. windows-wit:: Here's a script adjustment for Windows users
|
|
|
|
.. include:: topic/globscript2-windows.rst
|
|
|
|
.. runrecord:: _examples/DL-101-109-102
|
|
:language: console
|
|
:workdir: dl-101/DataLad-101
|
|
:emphasize-lines: 2
|
|
:notes: Dang, we made a mistake in our script: we only listed a part of the podcasts! Let's fix the script:
|
|
:cast: 02_reproducible_execution
|
|
|
|
$ cat << EOT >| code/list_titles.sh
|
|
for i in recordings/longnow/Long_Now*/*.mp3; do
|
|
# get the filename
|
|
base=\$(basename "\$i");
|
|
# strip the extension
|
|
base=\${base%.mp3};
|
|
printf "\${base%%__*}\t" | tr '_' '-';
|
|
# name and title without underscores
|
|
printf "\${base#*__}\n" | tr '_' ' ';
|
|
|
|
done
|
|
EOT
|
|
|
|
Because the script is now modified, save the modifications to the dataset.
|
|
We can use the shorthand "BF" to denote "Bug fix" in the commit message.
|
|
|
|
.. runrecord:: _examples/DL-101-109-103
|
|
:language: console
|
|
:workdir: dl-101/DataLad-101
|
|
:cast: 02_reproducible_execution
|
|
|
|
$ datalad status
|
|
|
|
.. runrecord:: _examples/DL-101-109-104
|
|
:language: console
|
|
:workdir: dl-101/DataLad-101
|
|
:cast: 02_reproducible_execution
|
|
|
|
$ datalad save -m "BF: list both directories content" \
|
|
code/list_titles.sh
|
|
|
|
What we *could* do is run the same :dlcmd:`run` command as before to recreate
|
|
the file, but now with all of the contents:
|
|
|
|
.. code-block:: console
|
|
|
|
$ # do not execute this!
|
|
$ datalad run -m "create a list of podcast titles" \
|
|
"bash code/list_titles.sh > recordings/podcasts.tsv"
|
|
|
|
However, think about any situation where the command would be longer than this,
|
|
or that is many months past the first execution. It would not be easy to remember
|
|
the command, nor would it be very convenient to copy it from the ``run record``.
|
|
|
|
Luckily, a fellow student remembered the DataLad way of re-executing
|
|
a ``run`` command, and he's eager to show it to you.
|
|
|
|
"In order to re-execute a :dlcmd:`run` command,
|
|
find the commit and use its :term:`shasum` (or a :term:`tag`, or anything else that Git
|
|
understands) as an argument for the
|
|
:dlcmd:`rerun` command! That's it!",
|
|
he says happily.
|
|
|
|
So you go ahead and find the commit :term:`shasum` in your history:
|
|
|
|
.. runrecord:: _examples/DL-101-109-105
|
|
:language: console
|
|
:workdir: dl-101/DataLad-101
|
|
:lines: 1-12
|
|
:emphasize-lines: 8
|
|
:notes: We could execute the same command as before. However, we can also let DataLad take care of it, and use the datalad rerun command.
|
|
:cast: 02_reproducible_execution
|
|
|
|
$ git log -n 2
|
|
|
|
Take that shasum and paste it after :dlcmd:`rerun`
|
|
(the first 6-8 characters of the shasum would be sufficient,
|
|
here we are using all of them).
|
|
|
|
.. runrecord:: _examples/DL-101-109-106
|
|
:language: console
|
|
:workdir: dl-101/DataLad-101
|
|
:realcommand: echo "$ datalad rerun $(git rev-parse HEAD~1)" && datalad rerun $(git rev-parse HEAD~1)
|
|
:notes: We'll find the shasum of the run commit and plug it into rerun
|
|
:cast: 02_reproducible_execution
|
|
|
|
Now DataLad has made use of the ``run record``, and
|
|
re-executed the original command based on the information in it.
|
|
Because we updated the script, the output ``podcasts.tsv``
|
|
has changed and now contains the podcast
|
|
titles of both subdirectories.
|
|
You've probably already guessed it, but the easiest way
|
|
to check whether a :dlcmd:`rerun`
|
|
has changed the desired output file is
|
|
to check whether the rerun command appears in the datasets history:
|
|
If a :dlcmd:`rerun` does not add or change any content in the dataset,
|
|
it will also not be recorded in the history.
|
|
|
|
.. runrecord:: _examples/DL-101-109-107
|
|
:language: console
|
|
:workdir: dl-101/DataLad-101
|
|
:notes: how does a rerun look in the history?
|
|
:cast: 02_reproducible_execution
|
|
|
|
$ git log -n 1
|
|
|
|
In the dataset's history,
|
|
we can see that a new :dlcmd:`run` was recorded. This action is
|
|
committed by DataLad under the original commit message of the ``run``
|
|
command, and looks just like the previous :dlcmd:`run` commit.
|
|
|
|
.. index::
|
|
pair: diff; DataLad command
|
|
|
|
Two cool tools that go beyond the :gitcmd:`log`
|
|
are the :dlcmd:`diff` and :gitcmd:`diff` commands.
|
|
Both commands can report differences between two states of
|
|
a dataset. Thus, you can get an overview of what changed between two commits.
|
|
Both commands have a similar, but not identical structure: :dlcmd:`diff`
|
|
compares one state (a commit specified with ``-f``/``--from``,
|
|
by default the latest change)
|
|
and another state from the dataset's history (a commit specified with
|
|
``-t``/``--to``). Let's do a :dlcmd:`diff` between the current state
|
|
of the dataset and the previous commit (called "``HEAD~1``" in Git terminology [#f1]_):
|
|
|
|
.. index::
|
|
pair: show dataset modification; on Windows with DataLad
|
|
pair: diff; DataLad command
|
|
pair: corresponding branch; in adjusted mode
|
|
.. windows-wit:: please use 'datalad diff --from main --to HEAD~1'
|
|
|
|
.. include:: topic/adjustedmode-diff.rst
|
|
|
|
.. index::
|
|
pair: diff; Git command
|
|
pair: show dataset modification; with DataLad
|
|
|
|
.. runrecord:: _examples/DL-101-109-108
|
|
:language: console
|
|
:workdir: dl-101/DataLad-101
|
|
:notes: The datalad diff command can help us find out what changed between the last two commands:
|
|
:cast: 02_reproducible_execution
|
|
|
|
$ datalad diff --to HEAD~1
|
|
|
|
.. index::
|
|
pair: diff; Git command
|
|
pair: show dataset modification; with Git
|
|
|
|
This indeed shows the output file as "modified". However, we do not know
|
|
what exactly changed. This is a task for :gitcmd:`diff` (get out of the
|
|
diff view by pressing ``q``):
|
|
|
|
.. runrecord:: _examples/DL-101-109-109
|
|
:language: console
|
|
:workdir: dl-101/DataLad-101
|
|
:notes: The git diff command has even more insights:
|
|
:cast: 02_reproducible_execution
|
|
:lines: 1-20
|
|
|
|
$ git diff HEAD~1
|
|
|
|
This output actually shows the precise changes between the contents created
|
|
with the first version of the script and the second script with the bug fix.
|
|
All of the files that are added after the second directory
|
|
was queried as well are shown in the ``diff``, preceded by a ``+``.
|
|
|
|
Quickly create a note about these two helpful commands in ``notes.txt``:
|
|
|
|
.. runrecord:: _examples/DL-101-109-110
|
|
:language: console
|
|
:workdir: dl-101/DataLad-101
|
|
:notes: Let's make a note about this.
|
|
:cast: 02_reproducible_execution
|
|
|
|
$ cat << EOT >> notes.txt
|
|
There are two useful functions to display changes between two
|
|
states of a dataset: "datalad diff -f/--from COMMIT -t/--to COMMIT"
|
|
and "git diff COMMIT COMMIT", where COMMIT is a shasum of a commit
|
|
in the history.
|
|
|
|
EOT
|
|
|
|
Finally, save this note.
|
|
|
|
.. runrecord:: _examples/DL-101-109-111
|
|
:language: console
|
|
:workdir: dl-101/DataLad-101
|
|
:cast: 02_reproducible_execution
|
|
|
|
$ datalad save -m "add note datalad and git diff"
|
|
|
|
Note that :dlcmd:`rerun` can re-execute the run records of both a :dlcmd:`run`
|
|
or a :dlcmd:`rerun` command,
|
|
but not with any other type of DataLad command in your history
|
|
such as a :dlcmd:`save` on results or outputs after you executed a script.
|
|
Therefore, make it a
|
|
habit to record the execution of scripts by plugging it into :dlcmd:`run`.
|
|
|
|
This very basic example of a :dlcmd:`run` is as simple as it can get, but it
|
|
is already
|
|
convenient from a memory-load perspective: Now you do not need to
|
|
remember the commands or scripts involved in creating an output. DataLad kept track
|
|
of what you did, and you can instruct it to "``rerun``" it.
|
|
Also, incidentally, we have generated :term:`provenance` information. It is
|
|
now recorded in the history of the dataset how the output ``podcasts.tsv`` came
|
|
into existence. And we can interact with and use this provenance information with
|
|
other tools than from the machine-readable ``run record``.
|
|
For example, to find out who (or what) created or modified a file,
|
|
give the file path to :gitcmd:`log` (prefixed by ``--``):
|
|
|
|
.. index::
|
|
pair: show history for particular paths; on Windows with Git
|
|
pair: log; Git command
|
|
pair: corresponding branch; in adjusted mode
|
|
.. windows-wit:: use 'git log main -- recordings/podcasts.tsv'
|
|
|
|
.. include:: topic/adjustedmode-log-path.rst
|
|
|
|
.. index::
|
|
pair: show history for particular paths; with Git
|
|
.. runrecord:: _examples/DL-101-109-112
|
|
:language: console
|
|
:workdir: dl-101/DataLad-101
|
|
:notes: An amazing thing is that DataLad captured all of the provenance of the output file, and we get use git tools to find out about it
|
|
:cast: 02_reproducible_execution
|
|
|
|
$ git log -- recordings/podcasts.tsv
|
|
|
|
|
|
Neat, isn't it?
|
|
|
|
Still, this :dlcmd:`run` was very simple.
|
|
The next section will demonstrate how :dlcmd:`run` becomes handy in
|
|
more complex standard use cases: situations with *locked* contents.
|
|
|
|
But prior to that, make a note about :dlcmd:`run` and :dlcmd:`rerun` in your
|
|
``notes.txt`` file.
|
|
|
|
.. runrecord:: _examples/DL-101-109-113
|
|
:language: console
|
|
:workdir: dl-101/DataLad-101
|
|
:notes: Another final note on run and rerun
|
|
:cast: 02_reproducible_execution
|
|
|
|
$ cat << EOT >> notes.txt
|
|
The datalad run command can record the impact a script or command has
|
|
on a Dataset. In its simplest form, datalad run only takes a commit
|
|
message and the command that should be executed.
|
|
|
|
Any datalad run command can be re-executed by using its commit shasum
|
|
as an argument in datalad rerun CHECKSUM. DataLad will take
|
|
information from the run record of the original commit, and re-execute
|
|
it. If no changes happen with a rerun, the command will not be written
|
|
to history. Note: you can also rerun a datalad rerun command!
|
|
|
|
EOT
|
|
|
|
Finally, save this note.
|
|
|
|
.. runrecord:: _examples/DL-101-109-114
|
|
:language: console
|
|
:workdir: dl-101/DataLad-101
|
|
:notes: Another final note on run and rerun
|
|
:cast: 02_reproducible_execution
|
|
|
|
$ datalad save -m "add note on basic datalad run and datalad rerun"
|
|
|
|
|
|
.. only:: adminmode
|
|
|
|
Add a tag at the section end.
|
|
|
|
.. runrecord:: _examples/DL-101-109-115
|
|
:language: console
|
|
:workdir: dl-101/DataLad-101
|
|
|
|
$ git branch sct_datalad_rerun
|
|
|
|
|
|
.. rubric:: Footnotes
|
|
|
|
.. [#f1] The section :ref:`history` will elaborate more on common :term:`Git` commands
|
|
and terminology.
|