datalad-handbook/docs/challenges/102-102-subdatasets.rst
Adina Wagner 91808575e2
Merge pull request #1266 from datalad-handbook/urls
adjust runrecord outputs to new urls
2025-06-24 14:25:10 +02:00

158 lines
No EOL
5.1 KiB
ReStructuredText

.. _challengeSubDS:
Challenge: DataLad Subdatasets
******************************
.. importantnote:: You can always get help
In order to learn about available DataLad commands, use ``datalad --help``. In order to learn more about a specific command, use ``datalad <subcommand> --help``.
Challenge 1
"""""""""""
Datasets can have subdatasets.
Let's build a nested dataset from scratch.
Start by creating a dataset called ``penguin-report``, and inside of it, create a subdataset called ``inputs``::
penguin-report
└── inputs
.. find-out-more:: Show me how to do it
Create a new dataset as the superdataset:
.. runrecord:: _examples/cha-102-102-subdatasets-1
:language: console
:workdir: challenges/102-102-subdataset
$ datalad create penguin-report
$ cd penguin-report
Next, a subdataset ``inputs`` is created and registered inside of the superdataset using ``-d``:
.. runrecord:: _examples/cha-102-102-subdatasets-2
:language: console
:workdir: challenges/102-102-subdataset/penguin-report
$ datalad create -d . inputs
Challenge 2
"""""""""""
Let's populate the subdataset with contents.
Download the following set of CSV files into the ``inputs`` dataset and save them:
- adelie.csv: https://pasta.lternet.edu/package/data/eml/knb-lter-pal/219/5/002f3893385f710df69eeebe893144ff
- gentoo.csv: https://pasta.lternet.edu/package/data/eml/knb-lter-pal/220/7/e03b43c924f226486f2f0ab6709d2381
- chinstrap.csv: https://pasta.lternet.edu/package/data/eml/knb-lter-pal/221/8/fe853aa8f7a59aa84cdd3197619ef462
.. find-out-more:: Downloading first:
There are several ways to accomplish this. The solution below uses ``download`` command from ``datalad-next`` and :dlcmd:`run` inside of the subdataset.
.. runrecord:: _examples/cha-102-102-subdatasets-3
:language: console
:workdir: challenges/102-102-subdataset/penguin-report
$ cd inputs
$ datalad run -m "Download penguin data" "datalad download 'https://pasta.lternet.edu/package/data/eml/knb-lter-pal/219/5/002f3893385f710df69eeebe893144ff adelie.tst' 'https://pasta.lternet.edu/package/data/eml/knb-lter-pal/220/7/e03b43c924f226486f2f0ab6709d2381 gentoo.tsv' 'https://pasta.lternet.edu/package/data/eml/knb-lter-pal/221/8/fe853aa8f7a59aa84cdd3197619ef462 chinstrap.csv'"
Afterwards, record the new subdataset state in the superdataset.
.. find-out-more:: Saving the updated subdataset state
``datalad status`` in the superdataset will show that the subdataset changed:
.. runrecord:: _examples/cha-102-102-subdatasets-4
:language: console
:workdir: challenges/102-102-subdataset/penguin-report/inputs
$ cd ..
$ datalad status
To save this most recent state use :dlcmd:`save` with ``-d``:
.. runrecord:: _examples/cha-102-102-subdatasets-5
:language: console
:workdir: challenges/102-102-subdataset/penguin-report/inputs
# navigate into penguin-report superdataset
$ cd ..
$ datalad save -d . -m "Save updated subdataset version"
Challenge 3
"""""""""""
Where can you find out about the subdataset version?
.. find-out-more:: Tell me!
The information is stored in commits about the subdataset - but only in the superdataset. Take a look at the so called "subproject commit":
.. runrecord:: _examples/cha-102-102-subdatasets-6
:language: console
:workdir: challenges/102-102-subdataset/penguin-report
$ git show inputs
Challenge 4
"""""""""""
Clone the following dataset: https://github.com/psychoinformatics-de/studyforrest-data.
Try to list the available subdatasets.
.. find-out-more:: I'm excited!
Start with cloning:
.. runrecord:: _examples/cha-102-102-subdatasets-7
:language: console
:workdir: challenges/102-102-subdataset
$ datalad clone https://github.com/psychoinformatics-de/studyforrest-data.git
Find out about subdatasets afterwards:
.. runrecord:: _examples/cha-102-102-subdatasets-8
:language: console
:workdir: challenges/102-102-subdataset
$ cd studyforrest-data
$ datalad subdatasets
Take a look at any of the subdatasets' directories. Why do they appear to be empty?
What do you need to do to retrieve availability information about a dataset, but not download its content? Try with the subdataset ``original/phase2``.
.. find-out-more:: Okidoki, I'm ready.
.. runrecord:: _examples/cha-102-102-subdatasets-9
:language: console
:workdir: challenges/102-102-subdataset/studyforrest-data
$ datalad get -n original/phase2
.. windows-wit:: Beware of Windows path semantics
On Windows, make sure to adjust the path to the subdataset::
$ datalad get -r original\phase2
Where can you find out about the origin location of a dataset's subdatasets?
.. find-out-more:: Let's see!
The information is stored in the superdatasets' ``.gitmodules`` file:
.. runrecord:: _examples/cha-102-102-subdatasets-11
:language: console
:workdir: challenges/102-102-subdataset/studyforrest-data
$ cat .gitmodules
Navigate into the newly installed subdataset ``original/phase2``.
Run :term:`gitk` and explore its files to find out what this dataset is all about.