datalad-handbook/docs/basics/101-110-run2.rst
2024-12-27 15:25:25 +01:00

521 lines
23 KiB
ReStructuredText

.. _run3:
Input and output
----------------
In the previous two sections, you created a simple ``.tsv`` file of all
speakers and talk titles in the ``longnow/`` podcasts subdataset, and you have
re-executed a :dlcmd:`run` command after a bug-fix in your script.
But these previous :dlcmd:`run` and :dlcmd:`rerun` command were very simple.
Maybe you noticed some values in the ``run record`` were empty:
``inputs`` and ``outputs`` for example did not have an entry. Let's experience
a few situations in which
these two arguments can become necessary.
In our DataLad-101 course we were given a group assignment. Everyone should
give a small presentation about an open DataLad dataset they found. Conveniently,
you decided to settle for the longnow podcasts right away.
After all, you know the dataset quite well already,
and after listening to almost a third of the podcasts
and enjoying them a lot,
you also want to recommend them to the others.
Almost all of the slides are ready, but what's still missing is the logo of the
longnow podcasts. Good thing that this is part of the subdataset,
so you can simply retrieve it from there.
The logos (one for the SALT series, one for the Interval series -- the two
directories in the subdataset)
were originally extracted from the podcasts metadata information by DataLad.
In a while, we will dive into the metadata aggregation capabilities of DataLad,
but for now, let's just use the logos instead of finding out where they
come from -- this will come later.
As part of the metadata of the dataset, the logos are
in the hidden paths
``.datalad/feed_metadata/logo_salt.jpg`` and
``.datalad/feed_metadata/logo_interval.jpg``:
.. runrecord:: _examples/DL-101-110-101
:language: console
:workdir: dl-101/DataLad-101
:notes: We saw a very simple datalad run. Now we are going to extend it with useful options. Narrative: prepare talk about dataset, add logo to slides. For this, we'll try to resize a logo in the meta data of the subdataset
:cast: 02_reproducible_execution
$ ls recordings/longnow/.datalad/feed_metadata/*jpg
For the slides you decide to prepare images of size 400x400 px, but
the logos' original size is much larger (both are 3000x3000 pixel). Therefore
let's try to resize the images -- currently, they are far too large to fit on a slide.
To resize an image from the command line we can use the Unix
command ``convert -resize`` from the `ImageMagick tool <https://imagemagick.org/index.php>`_.
The command takes a new size in pixels as an argument, a path to the file that should be
resized, and a filename and path under which a new,
resized image will be saved.
To resize one image to 400x400 px, the command would thus be
``convert -resize 400x400 path/to/file.jpg path/to/newfilename.jpg``.
.. index::
pair: install ImageMagick; on Windows
single: installation; ImageMagick
.. windows-wit:: Tool installation
.. include:: topic/installation-imagemagick.rst
Remembering the last lecture on :dlcmd:`run`, you decide to plug this into
:dlcmd:`run`. Even though this is not a script, it is a command, and you can wrap
commands like this conveniently with :dlcmd:`run`.
Because they will be quite long, we line break the commands in the upcoming examples
for better readability -- in your terminal, you can always write the commands into
a single line.
.. index::
pair: run command with provenance capture; with DataLad run
.. runrecord:: _examples/DL-101-110-102
:language: console
:workdir: dl-101/DataLad-101
:emphasize-lines: 4
:notes: This command resizes the logo to 400 by 400 px -- but it will fail!
:cast: 02_reproducible_execution
:exitcode: 1
$ datalad run -m "Resize logo for slides" \
"convert -resize 400x400 recordings/longnow/.datalad/feed_metadata/logo_salt.jpg recordings/salt_logo_small.jpg"
*Oh, crap!* Why didn't this work?
Let's take a look at the error message DataLad provides. In general, these error messages
might seem wordy, and maybe a bit intimidating as well, but usually they provide helpful
information to find out what is wrong. Whenever you encounter an error message,
make sure to read it, even if it feels like a mushroom cloud exploded in your terminal.
A :dlcmd:`run` error message has several parts. The first starts after
``[INFO ] == Command start (output follows) =====``.
This is displaying errors that the
terminal command threw: The ``convert`` tool complains that it cannot open
the file, because there is "No such file or directory".
The second part starts after
``[INFO ] == Command exit (modification check follows) =====``.
DataLad adds information about a "non-zero exit code". A non-zero exit code indicates
that something went wrong [#f1]_. In principle, you could go ahead and google what this
specific exit status indicates. However, the solution might have already occurred to you when
reading the first error report: The file is not present.
How can that be?
"Right!", you exclaim with a facepalm.
Just as the ``.mp3`` files, the ``.jpg`` file content is not present
locally after a :dlcmd:`clone`, and we did not :dlcmd:`get` it yet!
.. index::
pair: declare command input; with DataLad run
This is where the ``-i``/``--input`` option for a ``datalad run`` becomes useful.
The content of everything that is specified as an ``input`` will be retrieved
prior to running the command.
.. runrecord:: _examples/DL-101-110-103
:language: console
:workdir: dl-101/DataLad-101
:emphasize-lines: 8
:realcommand: datalad run --input "recordings/longnow/.datalad/feed_metadata/logo_salt.jpg" "convert -resize 400x400 recordings/longnow/.datalad/feed_metadata/logo_salt.jpg recordings/salt_logo_small.jpg"
:notes: The problem is that the content (logo) is not yet retrieved. The --input option makes sure that all content is retrieved prior to command execution.
:cast: 02_reproducible_execution
$ datalad run -m "Resize logo for slides" \
--input "recordings/longnow/.datalad/feed_metadata/logo_salt.jpg" \
"convert -resize 400x400 recordings/longnow/.datalad/feed_metadata/logo_salt.jpg recordings/salt_logo_small.jpg"
$ # or shorter:
$ datalad run -m "Resize logo for slides" \
-i "recordings/longnow/.datalad/feed_metadata/logo_salt.jpg" \
"convert -resize 400x400 recordings/longnow/.datalad/feed_metadata/logo_salt.jpg recordings/salt_logo_small.jpg"
Cool! You can see in this output that prior to the command execution, DataLad did a :dlcmd:`get`.
This is useful for several reasons. For one, it saved us the work of manually
getting content. But moreover, this is useful for anyone with whom we might share the
dataset: With an installed dataset one can very simply rerun :dlcmd:`run` commands
if they have the input argument appropriately specified. It is therefore good practice to
specify the inputs appropriately. Remember from section :ref:`installds`
that :dlcmd:`get` will only retrieve content if
it is not yet present, all input already downloaded will not be downloaded again -- so
specifying inputs even though they are already present will not do any harm.
.. index::
pair: path globbing; with DataLad run
.. find-out-more:: What if there are several inputs?
Often, a command needs several inputs. In principle, every input (which could be files, directories, or subdatasets) gets its own ``-i``/``--input``
flag. However, you can make use of :term:`globbing`. For example,
.. code-block:: console
$ datalad run --input "*.jpg" "COMMAND"
will retrieve all ``.jpg`` files prior to command execution.
If outputs already exist...
^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. index::
pair: files are unlocked by default; on Windows
pair: unlocked files; in adjusted mode
.. windows-wit:: Good news! Here is something that is easier on Windows
.. include:: topic/adjustedmode-unlockedfiles.rst
Looking at the resulting image, you wonder whether 400x400 might be a tiny bit too small.
Maybe we should try to resize it to 450x450, and see whether that looks better?
Note that we cannot use a :dlcmd:`rerun` for this: if we want to change the dimension option
in the command, we have to define a new :dlcmd:`run` command.
To establish best-practices, let's specify the input even though it is already present:
.. runrecord:: _examples/DL-101-110-104
:language: console
:workdir: dl-101/DataLad-101
:emphasize-lines: 9
:realcommand: datalad run --input "recordings/longnow/.datalad/feed_metadata/logo_salt.jpg" "convert -resize 450x450 recordings/longnow/.datalad/feed_metadata/logo_salt.jpg recordings/salt_logo_small.jpg"
:notes: Maybe 400x400 is too small. We should try 450x450. Can we use a datalad rerun for this? (no)
:exitcode: 1
:cast: 02_reproducible_execution
$ datalad run -m "Resize logo for slides" \
--input "recordings/longnow/.datalad/feed_metadata/logo_salt.jpg" \
"convert -resize 450x450 recordings/longnow/.datalad/feed_metadata/logo_salt.jpg recordings/salt_logo_small.jpg"
$ # or shorter:
$ datalad run -m "Resize logo for slides" \
-i "recordings/longnow/.datalad/feed_metadata/logo_salt.jpg" \
"convert -resize 450x450 recordings/longnow/.datalad/feed_metadata/logo_salt.jpg recordings/salt_logo_small.jpg"
**Oh wtf**... *What is it now?*
A quick glimpse into the error message shows a different error than before:
The tool complains that it is "unable to open" the image, because the "Permission [is] denied".
We have not seen anything like this before, and we need to turn to our lecturer for help.
Confused about what we might have
done wrong, we raise our hand to ask the instructor.
Knowingly, she smiles, and tells you about how DataLad protects content given
to it:
"Content in your DataLad dataset is protected by :term:`git-annex` from
accidental changes" our instructor begins.
"Wait!" we interrupt. "First off, that wasn't accidental. And second, I was told this
course does not have ``git-annex-101`` as a prerequisite?"
"Yes, hear me out" she says. "I promise you two different solutions at
the end of this explanation, and the concept behind this is quite relevant".
DataLad usually gives content to :term:`git-annex` to store and track.
git-annex, let's just say, takes this task *really* seriously. One of its
features that you have just experienced is that it *locks* content.
If files are *locked down*, their content cannot be modified. In principle,
that's not a bad thing: It could be your late grandma's secret cherry-pie
recipe, and you do not want to *accidentally* change that.
Therefore, a file needs to be consciously *unlocked* to apply modifications.
In the attempt to resize the image to 450x450 you tried to overwrite
``recordings/salt_logo_small.jpg``, a file that was given to DataLad
and thus protected by git-annex.
.. index::
pair: unlock; DataLad command
pair: unlock file; with DataLad
There is a DataLad command that takes care of unlocking file content,
and thus making locked files modifiable again: :dlcmd:`unlock`.
Let us check out what it does:
.. index::
pair: files are unlocked by default; on Windows
single: adjusted branch; unlocked files
.. windows-wit:: What happens if I run this on Windows?
.. include:: topic/adjustedmode-unlockedfiles2.rst
.. runrecord:: _examples/DL-101-111-101
:language: console
:workdir: dl-101/DataLad-101
:notes: The created output is protected from accidental modifications, we have to unlock it first:
:cast: 02_reproducible_execution
$ datalad unlock recordings/salt_logo_small.jpg
Well, ``unlock(ok)`` does not sound too bad for a start. As always, we
feel the urge to run a :dlcmd:`status` on this:
.. runrecord:: _examples/DL-101-111-102
:language: console
:workdir: dl-101/DataLad-101
:notes: How does the file look like after an unlock?
:cast: 02_reproducible_execution
$ datalad status
"Ah, do not mind that for now", our instructor says, and with a wink she
continues: "We'll talk about symlinks and object trees a while later".
You are not really sure whether that's a good thing, but you have a task to focus
on. Hastily, you run the command right from the terminal:
.. runrecord:: _examples/DL-101-111-103
:language: console
:workdir: dl-101/DataLad-101
:notes: In principle, you could rerun the command now, outside of any datalad run. The unlocked output can be overwritten
:cast: 02_reproducible_execution
$ convert -resize 450x450 recordings/longnow/.datalad/feed_metadata/logo_salt.jpg recordings/salt_logo_small.jpg
Hey, no permission denied error! You note that the instructor still stands
right next to you. "Sooo... now what do I do to *lock* the file again?" you ask.
"Well... what you just did there was quite suboptimal. Didn't you want to
use :dlcmd:`run`? But, anyway, in order to lock the file again, you would need to
run a :dlcmd:`save`."
.. runrecord:: _examples/DL-101-111-104
:language: console
:workdir: dl-101/DataLad-101
:notes: Afterwards you'd need to save, to lock everything again
:cast: 02_reproducible_execution
$ datalad save -m "resized picture by hand"
"So", you wonder aloud, "whenever I want to modify I need to
:dlcmd:`unlock` it, do the modifications, and then :dlcmd:`save` it?"
"Well, this is certainly one way of doing it, and a completely valid workflow
if you would do that outside of a :dlcmd:`run` command.
But within :dlcmd:`run` there is actually a much easier way of doing this.
Let's use the ``--output`` argument."
:dlcmd:`run` *retrieves* everything that is specified as ``--input`` prior to
command execution, and it *unlocks* everything specified as ``--output`` prior to
command execution. Therefore, whenever the output of a :dlcmd:`run` command already
exists and is tracked, it should be specified as an argument in
the ``-o``/``--output`` option.
.. index::
pair: path globbing; with DataLad run
.. find-out-more:: But what if I have a lot of outputs?
The use case here is simplistic -- a single file gets modified.
But there are commands and tools that create full directories with
many files as an output.
The easiest way to specify this type of output
is by supplying the directory name, or the directory name and a :term:`globbing` character, such as
``-o directory/*.dat``.
This would unlock all files with a ``.dat`` extension inside of ``directory``.
To glob for files in multiple levels of directories, use ``**`` (a so-called `globstar <https://www.linuxjournal.com/content/globstar-new-bash-globbing-option>`_) for a recursive glob through any number directories.
And, just as for ``-i``/``--input``, you could use multiple ``--output`` specifications.
.. index::
pair: declare command output; with DataLad run
In order to execute :dlcmd:`run` with both the ``-i``/``--input`` and ``-o``/``--output``
flag and see their magic, let's resize the second logo, ``logo_interval.jpg``:
.. index::
pair: files are unlocked by default; on Windows
pair: run; DataLad command
pair: unlocked files; in adjusted mode
.. windows-wit:: Wait, would I need to specify outputs, too?
.. include:: topic/adjustedmode-unlockedfiles-output.rst
.. runrecord:: _examples/DL-101-111-105
:language: console
:workdir: dl-101/DataLad-101
:emphasize-lines: 11
:realcommand: datalad run --input "recordings/longnow/.datalad/feed_metadata/logo_interval.jpg" --output "recordings/interval_logo_small.jpg" "convert -resize 450x450 recordings/longnow/.datalad/feed_metadata/logo_interval.jpg recordings/interval_logo_small.jpg"
:notes: but it is way easier to just use the --output option of datalad run: it takes care of unlocking if necessary
:cast: 02_reproducible_execution
$ datalad run -m "Resize logo for slides" \
--input "recordings/longnow/.datalad/feed_metadata/logo_interval.jpg" \
--output "recordings/interval_logo_small.jpg" \
"convert -resize 450x450 recordings/longnow/.datalad/feed_metadata/logo_interval.jpg recordings/interval_logo_small.jpg"
$ # or shorter:
$ datalad run -m "Resize logo for slides" \
-i "recordings/longnow/.datalad/feed_metadata/logo_interval.jpg" \
-o "recordings/interval_logo_small.jpg" \
"convert -resize 450x450 recordings/longnow/.datalad/feed_metadata/logo_interval.jpg recordings/interval_logo_small.jpg"
This time, with both ``--input`` and ``--output``
options specified, DataLad informs about the :dlcmd:`get`
operations it performs prior to the command
execution, and :dlcmd:`run` executes the command successfully.
It does *not* inform about any :dlcmd:`unlock` operation,
because the output ``recordings/interval_logo_small.jpg`` does not
exist before the command is run. Should you rerun this command however,
the summary will include a statement about content unlocking. You will
see an example of this in the next section.
Note now how many individual commands a :dlcmd:`run` saves us:
:dlcmd:`get`, :dlcmd:`unlock`, and :dlcmd:`save`!
But even better: Beyond saving time *now*, running commands reproducibly and
recorded with :dlcmd:`run` saves us plenty of time in the future as soon
as we want to rerun a command, or find out how a file came into existence.
With this last code snippet, you have experienced a full :dlcmd:`run` command: commit message,
input and output definitions (the order in which you give those two options is irrelevant),
and the command to be executed. Whenever a command takes input or produces output you should specify
this with the appropriate option.
Make a note of this behavior in your ``notes.txt`` file.
.. runrecord:: _examples/DL-101-111-106
:language: console
:workdir: dl-101/DataLad-101
:notes: Finally, lets add a note on this
:cast: 02_reproducible_execution
$ cat << EOT >> notes.txt
You should specify all files that a command takes as input with an
-i/--input flag. These files will be retrieved prior to the command
execution. Any content that is modified or produced by the command
should be specified with an -o/--output flag. Upon a run or rerun of
the command, the contents of these files will get unlocked so that
they can be modified.
EOT
Save yourself the preparation time
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
It's generally good practice to specify ``--input`` and ``--output`` even if your input files are already retrieved and your output files unlocked -- it makes sure that a recomputation can succeed, even if inputs are not yet retrieved, or if output needs to be unlocked.
However, the internal preparation steps of checking that inputs exist or that outputs are unlocked can take a bit of time, especially if it involves checking a large number of files.
If you want to avoid the expense of unnecessary preparation steps you can make use of the ``--assume-ready`` argument of :dlcmd:`run`.
Depending on whether your inputs are already retrieved, your outputs already unlocked (or not needed to be unlocked), or both, specify ``--assume-ready`` with the argument ``inputs``, ``outputs`` or ``both`` and save yourself a few seconds, without sacrificing the ability to rerun your command under conditions in which the preparation would be necessary.
Placeholders
^^^^^^^^^^^^
Just after writing the note, you had to relax your fingers a bit. "Man, this was
so much typing. Not only did I need to specify the inputs and outputs, I also had
to repeat all of these lengthy paths in the command line call..." you think.
There is a neat little trick to spare you half of this typing effort, though: *Placeholders*
for inputs and outputs. This is how it works:
Instead of running
.. code-block:: console
$ datalad run -m "Resize logo for slides" \
--input "recordings/longnow/.datalad/feed_metadata/logo_interval.jpg" \
--output "recordings/interval_logo_small.jpg" \
"convert -resize 450x450 recordings/longnow/.datalad/feed_metadata/logo_interval.jpg recordings/interval_logo_small.jpg"
you could shorten this to
.. code-block:: console
:emphasize-lines: 4
$ datalad run -m "Resize logo for slides" \
--input "recordings/longnow/.datalad/feed_metadata/logo_interval.jpg" \
--output "recordings/interval_logo_small.jpg" \
"convert -resize 450x450 {inputs} {outputs}"
The placeholder ``{inputs}`` will expand to the path given as ``--input``, and
the placeholder ``{outputs}`` will expand to the path given as ``--output``.
This means instead of writing the full paths in the command, you can simply reuse
the ``--input`` and ``--output`` specification done before.
.. index::
pair: multiple command inputs; with DataLad run
.. find-out-more:: What if I have multiple inputs or outputs?
If multiple values are specified, e.g., as in
.. code-block:: console
$ datalad run -m "move a few files around" \
--input "file1" --input "file2" --input "file3" \
--output "directory_a/" \
"mv {inputs} {outputs}"
the values will be joined by a space like this:
.. code-block:: console
$ datalad run -m "move a few files around" \
--input "file1" --input "file2" --input "file3" \
--output "directory_a/" \
"mv file1 file2 file3 directory_a/"
The order of the values will match that order from the command line.
If you use globs for input specification, as in
.. code-block:: console
$ datalad run -m "move a few files around" \
--input "file*" \
--output "directory_a/" \
"mv {inputs} {outputs}"
the globs will expanded in alphabetical order (like bash):
.. code-block:: console
$ datalad run -m "move a few files around" \
--input "file1" --input "file2" --input "file3" \
--output "directory_a/" \
"mv file1 file2 file3 directory_a/"
If the command only needs a subset of the inputs or outputs, individual values
can be accessed with an integer index, e.g., ``{inputs[0]}`` for the very first
input.
.. index::
pair: run command with curly brackets; with DataLad run
.. find-out-more:: ... wait, what if I need a curly bracket in my 'datalad run' call?
If your command call involves a ``{`` or ``}`` character, you will need to escape
this brace character by doubling it, i.e., ``{{`` or ``}}``.
.. index::
pair: dry-run; with DataLad run
.. _dryrun:
Dry-running your run call
^^^^^^^^^^^^^^^^^^^^^^^^^
:dlcmd:`run` commands can become confusing and long, especially when you make heavy use of placeholders or wrap a complex bash commands.
To better anticipate what you will be running, or help debug a failed command, you can make use of the ``--dry-run`` flag of ``datalad run``.
This option needs a mode specification (``--dry-run=basic`` or ``dry-run=command``), followed by the ``run`` command you want to execute, and it will decipher the commands elements:
The mode ``command`` will display the command that is about to be ran.
The mode ``basic`` will report a few important details about the execution:
Apart from displaying the command that will be ran, you will learn *where* the command runs, what its *inputs* are (helpful if your ``--input`` specification includes a :term:`globbing` term), and what its *outputs* are.
.. only:: adminmode
Add a tag at the section end.
.. runrecord:: _examples/DL-101-111-107
:language: console
:workdir: dl-101/DataLad-101
$ git branch sct_input_and_output
.. [#f1] In shell programming, commands exit with a specific code that indicates
whether they failed, and if so, how. Successful commands have the exit code zero. All failures
have exit codes greater than zero.