247 lines
9.9 KiB
ReStructuredText
247 lines
9.9 KiB
ReStructuredText
.. _run:
|
|
|
|
Keeping track
|
|
-------------
|
|
|
|
In previous examples, with the exception of :dlcmd:`download-url`, all
|
|
changes that happened to the dataset or the files it contains were
|
|
saved to the dataset's history by hand. We added larger and smaller
|
|
files and saved them, and we also modified smaller file contents and
|
|
saved these modifications.
|
|
|
|
Often, however, files get changed by shell commands
|
|
or by scripts.
|
|
Consider a data scientist.
|
|
She has data files with numeric data,
|
|
and code scripts in Python, R, Matlab or any other programming language
|
|
that will use the data to compute results or figures. Such output is
|
|
stored in new files, or modifies existing files.
|
|
|
|
But only a few weeks after these scripts were executed she finds it hard
|
|
to remember which script was modified for which reason or created which
|
|
output. How did this result came to be? Which script would she need
|
|
to run again on which data to produce this particular figure?
|
|
|
|
In this section we will experience how DataLad can help
|
|
to record the changes in a dataset after executing a script
|
|
from the shell. Just as :dlcmd:`download-url` was able to associate
|
|
a file with its origin and store this information, we want to be
|
|
able to associate a particular file with the commands, scripts, and inputs
|
|
it was produced from, and thus capture and store full :term:`provenance`.
|
|
|
|
Let's say, for example, that you enjoyed the longnow podcasts a lot,
|
|
and you start a podcast-night with friends to wind down from all of
|
|
the exciting DataLad lectures. They propose to make a
|
|
list of speakers and titles to cross out what they've already listened
|
|
to, and ask you to prepare such a list.
|
|
|
|
"Mhh... probably there is a DataLad way to do this... wasn't there also
|
|
a note about metadata extraction at some point?" But as we are not that
|
|
far into the lectures, you decide to write a short shell script
|
|
to generate a text file that lists speaker and title
|
|
name instead.
|
|
|
|
To do this, we are following a best practice that will reappear in the
|
|
later section on :ref:`YODA principles <yoda>`: Collecting all
|
|
additional scripts that work with content of a subdataset *outside*
|
|
of this subdataset, in a dedicated ``code/`` directory,
|
|
and collating the output of the execution of these scripts
|
|
*outside* of the subdataset as well -- and
|
|
therefore not modifying the subdataset.
|
|
|
|
The motivation behind this will become clear in later sections,
|
|
but for now we'll start with best-practice building.
|
|
Therefore, create a subdirectory ``code/`` in the ``DataLad-101``
|
|
superdataset:
|
|
|
|
.. runrecord:: _examples/DL-101-108-101
|
|
:language: console
|
|
:workdir: dl-101
|
|
:notes: it's impossible to remember how a large data analysis produced which result for a human, but datalad can help to keep track. To see this in action, we'll do a data analysis. Start with yoda principles and structure ds with code directory.
|
|
:cast: 02_reproducible_execution
|
|
:realcommand: cd DataLad-101 && mkdir code && tree -d
|
|
|
|
$ mkdir code
|
|
$ tree -d
|
|
|
|
Inside of ``DataLad-101/code``, create a simple shell script ``list_titles.sh``.
|
|
This script will carry out a simple task:
|
|
It will loop through the file names of the ``.mp3`` files and
|
|
write out speaker names and talk titles in a very basic fashion.
|
|
The ``cat`` command will write the script content into ``code/list_titles.sh``.
|
|
|
|
.. windows-wit:: Here's a script for Windows users
|
|
|
|
.. include:: topic/globscript1-windows.rst
|
|
|
|
.. index::
|
|
pair: hidden file name extensions; on Windows
|
|
.. windows-wit:: Be mindful of hidden extensions when creating files!
|
|
|
|
.. include:: topic/hidden-extensions.rst
|
|
|
|
.. runrecord:: _examples/DL-101-108-102
|
|
:language: console
|
|
:workdir: dl-101/DataLad-101
|
|
:notes: We will create a script to execute. Let's make one that summarizes the podcasts titles in the longnow dataset:
|
|
:cast: 02_reproducible_execution
|
|
|
|
$ cat << EOT > code/list_titles.sh
|
|
for i in recordings/longnow/Long_Now__Seminars*/*.mp3; do
|
|
# get the filename
|
|
base=\$(basename "\$i");
|
|
# strip the extension
|
|
base=\${base%.mp3};
|
|
# date as yyyy-mm-dd
|
|
printf "\${base%%__*}\t" | tr '_' '-';
|
|
# name and title without underscores
|
|
printf "\${base#*__}\n" | tr '_' ' ';
|
|
done
|
|
EOT
|
|
|
|
Save this script to the dataset.
|
|
|
|
.. runrecord:: _examples/DL-101-108-103
|
|
:language: console
|
|
:workdir: dl-101/DataLad-101
|
|
:notes: We have to save the script first: status and save
|
|
:cast: 02_reproducible_execution
|
|
|
|
$ datalad status
|
|
|
|
.. runrecord:: _examples/DL-101-108-104
|
|
:language: console
|
|
:workdir: dl-101/DataLad-101
|
|
:notes: ... preferably with a helpful commit message
|
|
:cast: 02_reproducible_execution
|
|
|
|
$ datalad save -m "Add short script to write a list of podcast speakers and titles"
|
|
|
|
Once we run this script, it will simply print dates, names and titles to
|
|
your terminal. We can save its outputs to a new file
|
|
``recordings/podcasts.tsv`` in the superdataset by redirecting these
|
|
outputs with ``bash code/list_titles.sh > recordings/podcasts.tsv``.
|
|
|
|
Obviously, we could create this file, and subsequently save it to the superdataset.
|
|
However, just as in the example about the data scientist,
|
|
in a bit of time, we will forget how this file came into existence, or
|
|
that the script ``code/list_titles.sh`` is associated with this file, and
|
|
can be used to update it later on.
|
|
|
|
.. index::
|
|
pair: run; DataLad command
|
|
pair: run command with provenance capture; with DataLad
|
|
pair: run command with provenance capture; with DataLad run
|
|
|
|
The :dlcmd:`run` command
|
|
can help with this. Put simply, it records a command's impact on a dataset. Put
|
|
more technically, it will record a shell command, and :dlcmd:`save` all changes
|
|
this command triggered in the dataset -- be that new files or changes to existing
|
|
files.
|
|
|
|
Let's try the simplest way to use this command: :dlcmd:`run`,
|
|
followed by a commit message (``-m "a concise summary"``), and the
|
|
command that executes the script from the shell: ``bash code/list_titles.sh > recordings/podcasts.tsv``.
|
|
It is helpful to enclose the command in quotation marks.
|
|
|
|
Note that we execute the command from the root of the superdataset.
|
|
It is recommended to use :dlcmd:`run` in the root of the dataset
|
|
you want to record the changes in, so make sure to run this
|
|
command from the root of ``DataLad-101``.
|
|
|
|
.. runrecord:: _examples/DL-101-108-105
|
|
:language: console
|
|
:workdir: dl-101/DataLad-101
|
|
:notes: The datalad run command records a command's impact on a dataset. We try it in the most simple way:
|
|
:cast: 02_reproducible_execution
|
|
|
|
$ datalad run -m "create a list of podcast titles" \
|
|
"bash code/list_titles.sh > recordings/podcasts.tsv"
|
|
|
|
Let's take a look into the history:
|
|
|
|
.. runrecord:: _examples/DL-101-108-106
|
|
:language: console
|
|
:workdir: dl-101/DataLad-101
|
|
:lines: 1-30
|
|
:emphasize-lines: 6, 11, 25
|
|
:notes: Let's now check what has been written into the history. (runrecord)
|
|
:cast: 02_reproducible_execution
|
|
|
|
$ git log -p -n 1 # On Windows, you may just want to type "git log".
|
|
|
|
The commit message we have supplied with ``-m`` directly after :dlcmd:`run` appears
|
|
in our history as a short summary.
|
|
Additionally, the output of the command, ``recordings/podcasts.tsv``,
|
|
was saved right away.
|
|
|
|
But there is more in this log entry, a section in between the markers
|
|
|
|
|
|
``=== Do not change lines below ===`` and
|
|
|
|
``^^^ Do not change lines above ^^^``.
|
|
|
|
This is the so-called ``run record`` -- a recording of all of the
|
|
information in the :dlcmd:`run` command, generated by DataLad.
|
|
In this case, it is a very simple summary. One informative
|
|
part is highlighted:
|
|
``"cmd": "bash code/list_titles.sh"`` is the command that was run
|
|
in the terminal.
|
|
This information therefore maps the command, and with it the script,
|
|
to the output file, in one commit. Nice, isn't it?
|
|
|
|
Arguably, the :term:`run record` is not the most human-readable way to display information.
|
|
This representation however is less for the human user (the human user should
|
|
rely on their informative commit message), but for DataLad, in particular for the
|
|
:dlcmd:`rerun` command, which you will see in action shortly. This
|
|
``run record`` is machine-readable provenance that associates an output with
|
|
the command that produced it.
|
|
|
|
You have probably already guessed that every :dlcmd:`run` command
|
|
ends with a ``datalad save``. A logical consequence from this fact is that any
|
|
:dlcmd:`run` that does not result in any changes in a dataset (no modification
|
|
of existing content; no additional files) will not produce any record in the
|
|
dataset's history (just as a :dlcmd:`save` with no modifications present
|
|
will not create a history entry). Try to run the exact same
|
|
command as before, and check whether anything in your log changes:
|
|
|
|
.. runrecord:: _examples/DL-101-108-107
|
|
:language: console
|
|
:workdir: dl-101/DataLad-101
|
|
:notes: A run command that does not result in changes (no modifications, no additional files) will not produce a record in the dataset history. So what happens if we do the same again?
|
|
:cast: 02_reproducible_execution
|
|
|
|
$ datalad run -m "Try again to create a list of podcast titles" \
|
|
"bash code/list_titles.sh > recordings/podcasts.tsv"
|
|
|
|
.. runrecord:: _examples/DL-101-108-108
|
|
:language: console
|
|
:workdir: dl-101/DataLad-101
|
|
:lines: 1-5
|
|
:emphasize-lines: 2
|
|
:notes: as the result is byte-identical, there is no new commit
|
|
:cast: 02_reproducible_execution
|
|
|
|
$ git log --oneline
|
|
|
|
The most recent commit is still the :dlcmd:`run` command from before,
|
|
and there was no second :dlcmd:`run` commit created.
|
|
|
|
The :dlcmd:`run` can therefore help you to keep track of what you are doing
|
|
in a dataset and capture provenance of your files: When, by whom, and how exactly
|
|
was a particular file created or modified?
|
|
The next sections will demonstrate how to make use of this information,
|
|
and also how to extend the command with additional arguments that will prove to
|
|
be helpful over the course of this chapter.
|
|
|
|
|
|
.. only:: adminmode
|
|
|
|
Add a tag at the section end.
|
|
|
|
.. runrecord:: _examples/DL-101-108-109
|
|
:language: console
|
|
:workdir: dl-101/DataLad-101
|
|
|
|
$ git branch sct_keeping_track
|