datalad-handbook/docs/challenges/102-103-provenance.rst
2025-06-24 10:08:37 +02:00

208 lines
No EOL
5.6 KiB
ReStructuredText

.. _challengeProv:
Challenge: Provenance Capture
*****************************
.. importantnote:: You can always get help
In order to learn about available DataLad commands, use ``datalad --help``. In order to learn more about a specific command, use ``datalad <subcommand> --help``.
Challenge 1:
""""""""""""
Create a DataLad dataset called ``iyoda``, applying a specific post-creation routine called ``yoda`` (``-c yoda``).
.. find-out-more:: Okidoki!
Creating the dataset
.. runrecord:: _examples/cha-102-103-provenance-1
:language: console
:workdir: challenges/102-103-provenance
$ datalad create -c yoda iyoda
Run the command :term:`gitk`:
- What did the "YODA" setup actually do?
- How do we know that data module should go into ``inputs/``?
Add the following dataset as a subdataset called ``inputs``: https://github.com/datalad-handbook/iris_data.
Inspect its history with :term:`gitk`. When was it made, what does it contain?
.. find-out-more:: Show me the solution
.. runrecord:: _examples/cha-102-103-provenance-2
:language: console
:workdir: challenges/102-103-provenance/
$ cd iyoda
$ datalad clone -d . https://github.com/datalad-handbook/iris_data.git inputs
Hint: In order to investigate the subdatasets history, enter it first.
Inside of ``iyoda``, create a script ``code/extract.py`` with the following content::
from os.path import join as opj
import csv
with open(opj('inputs', 'iris.csv')) as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
if row['variety'] != 'Setosa':
continue
print(row['petal.length'])
.. runrecord:: _examples/cha-102-103-provenance-3
:language: console
:workdir: challenges/102-103-provenance/iyoda/
$ cat << EOT > code/extract.py
from os.path import join as opj
import csv
with open(opj('inputs', 'iris.csv')) as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
if row['variety'] != 'Setosa':
continue
print(row['petal.length'])
EOT
This Python script will print all rows matching the Setosa variety.
Be careful: With Python, consistent indentation with tabs OR spaces is necessary!
You can test your script with ``$ python code/extract.py``
.. windows-wit:: Beware of path semantics on Windows
On Windows, test the script with::
$ python code\extract.py
If there are no errors, save the script.
.. find-out-more:: Tell me how!
.. runrecord:: _examples/cha-102-103-provenance-4
:language: console
:workdir: challenges/102-103-provenance/iyoda/
$ datalad save -m "Save data extraction script"
Try to figure out why there was no output when running the script.
[ ... ]
*space for a dramatic pause*
[ ... ]
Retry running the script after getting content from the subdataset.
.. find-out-more:: Right, let's go!
To retrieve contents from the subdataset run:
.. runrecord:: _examples/cha-102-103-provenance-5
:language: console
:workdir: challenges/102-103-provenance/iyoda
$ datalad get inputs
.. runrecord:: _examples/cha-102-103-provenance-6
:language: console
:workdir: challenges/102-103-provenance/iyoda
$ python code/extract.py
Now that the script works as intended, run it and write its outputs into a file for further processing using the following code:
.. runrecord:: _examples/cha-102-103-provenance-7
:language: console
:workdir: challenges/102-103-provenance/iyoda/
$ python code/extract.py > outputs.dat
.. windows-wit:: Beware of path semantics on Windows
On Windows, test the script with::
$ python code\extract.py > outputs.dat
Check the dataset state and save the modification. Inspect the change record:
- What information is captured?
- Imagine yourself in a year. What information would you be missing?
.. find-out-more:: Let's take a look
.. runrecord:: _examples/cha-102-103-provenance-8
:language: console
:workdir: challenges/102-103-provenance/iyoda
$ datalad status
and save:
.. runrecord:: _examples/cha-102-103-provenance-9
:language: console
:workdir: challenges/102-103-provenance/iyoda
$ datalad save -m "Create the desired setosa variety petal length data file"
Run the script again, but through DataLad, and declare inputs and outputs.
This time, save the output file as ``plength.txt``.
Use :term:`gitk` to inspect the change record.
What is different now?
.. find-out-more:: Let's take a look
.. runrecord:: _examples/cha-102-103-provenance-10
:language: console
:workdir: challenges/102-103-provenance/iyoda
$ datalad run -i inputs/iris.csv -o plength.txt "python code/extract.py > {outputs}"
But beware on Windows!
.. windows-wit:: paths strike again!
.. code-block::
$ datalad run -i inputs\iris.csv -o plength.txt "python code\extract.py > {outputs}"
Finally, force DataLad to lose the file ``plength.txt``.
Re-execute the provenance record.
Afterwards, check what has changed.
.. find-out-more:: Final stretch now
First, drop recklessy:
.. runrecord:: _examples/cha-102-103-provenance-11
:language: console
:workdir: challenges/102-103-provenance/iyoda
$ datalad drop --reckless availability plength.txt
Then, rerun:
.. runrecord:: _examples/cha-102-103-provenance-12
:language: console
:workdir: challenges/102-103-provenance/iyoda
$ datalad rerun HEAD
What changed?
.. runrecord:: _examples/cha-102-103-provenance-13
:language: console
:workdir: challenges/102-103-provenance/iyoda
$ datalad status