datalad-handbook/docs/challenges/102-101-datasets.rst

.. index::
   pair: create; DataLad command
   pair: create dataset; with DataLad
.. _challengeDS:

Challenge: DataLad Datasets
***************************

.. importantnote:: You can always get help

   In order to learn about available DataLad commands, use ``datalad --help``. In order to learn more about a specific command, use ``datalad <subcommand> --help``.


Challenge 1
"""""""""""

Create a dataset called ``my-dataset`` on your computer.
Inside of the dataset, run the command :term:`gitk` and explore it.

Can you find:

- the dataset identifier?
- the version label?
- the dataset creator?
- the dataset creation date?

Afterwards, run the command ``gitk --all``. What is the difference from before?

.. find-out-more:: Show me how to do it

	To create a new dataset, run:

	.. runrecord:: _examples/cha-102-101-datasets-1
	   :language: console
	   :workdir: challenges/102-101-dataset

	   $ datalad create my-dataset

Finally, remove the dataset.

.. find-out-more:: How do I do that?

	To remove it, run :dlcmd:`drop`. Importantly, this command needs to run *outside* of the dataset.

	.. runrecord:: _examples/cha-102-101-datasets-2
	   :language: console
	   :workdir: challenges/102-101-dataset

	   $ datalad drop --what all -d my-dataset --reckless availability


Challenge 2
"""""""""""

Text files are digital files containing plain text.
Take a minute to think:
- Why is it often useful to keep textfiles out of git-annex?
On the other hand, what could be a reason to annex text files?

.. find-out-more:: Tell me!

   **Why is it useful to keep textfiles out of git-annex**?

   - To make editing easier (no need to unlock)
   - To have a nicer Git history (commits can show differences between file revisions)
   - To distribute the file automatically with every clone (unlike with annexed files, file content of files kept in Git is readily available in shared dataset clones)

   **What could be a reason to annex text files?**

   - To keep file contents private/secret (annexing files allows access control)
   - An unusually large text file (at least dozens of MB)

Create a DataLad dataset called ``text2gitdataset`` and configure it to never annex text files (there are several ways to do this!).

.. find-out-more:: Ok, show me the ways!

   **1. Right at dataset creation**

   .. runrecord:: _examples/cha-102-101-datasets-3
      :language: console
      :workdir: challenges/102-101-dataset

      $ datalad create -c text2git text2gitdataset

   **2. After dataset creation** with a :dlcmd:`run-procedure`

   .. runrecord:: _examples/cha-102-101-datasets-4
      :language: console
      :workdir: challenges/102-101-dataset

      $ datalad create text2gitdataset-2
      $ cd text2gitdataset-2
      $ datalad run-procedure cfg_text2git

   **3. By editing .gitattributes by hand**

   .. runrecord:: _examples/cha-102-101-datasets-5
      :language: console
      :workdir: challenges/102-101-dataset

      $ datalad create text2gitdataset-3
      $ cd text2gitdataset-3
      $ echo "* annex.largefiles=(mimeencoding=binary)and(largerthan=0))" >> .gitattributes
      $ datalad save -m "configure Dataset to keep text files in Git"

In the end, remove the datasets.

.. find-out-more:: Can you show me again?

   Clean-up:

   .. runrecord:: _examples/cha-102-101-datasets-6
      :language: console
      :workdir: challenges/102-101-dataset

      $ datalad drop -d text2gitdataset --what all --reckless availability
      $ datalad drop -d text2gitdataset-2 --what all --reckless availability
      $ datalad drop -d text2gitdataset-3 --what all --reckless availability

Challenge 3
"""""""""""

Version controlling a file means to record its changes over time, associate those changes with an author, date, and identifier, creating a lineage of file content, and being able to revert changes or restore previous file versions.
DataLad datasets can version control their contents, regardless of size.

Create a new dataset ``my-dataset`` that is configured to store text files in Git (see previous challenge) and add a ``README.md`` file with some content into it.
Make sure to save it with a helpful commit message, and inspect your datasets revision history.

.. find-out-more:: Let's go!

   Create the dataset and ``cd`` into it:

   .. runrecord:: _examples/cha-102-101-datasets-7
      :language: console
      :workdir: challenges/102-101-dataset

      $ datalad create -c text2git my-dataset
      $ cd my-dataset

   Create a text file and save it (you can also create a text file with an editor of your choice, e.g., :term:`vim`.)

   .. runrecord:: _examples/cha-102-101-datasets-8
      :language: console
      :workdir: challenges/102-101-dataset/my-dataset

      $ echo "# Example Dataset" > README.md
      $ datalad status

   .. runrecord:: _examples/cha-102-101-datasets-9
      :language: console
      :workdir: challenges/102-101-dataset/my-dataset

      $ datalad save -m "add a README to the dataset"

   Check the dataset's history:

   .. runrecord:: _examples/cha-102-101-datasets-10
      :language: console
      :workdir: challenges/102-101-dataset/my-dataset

      $ git log

Run :term:`gitk` again. Can you find the dataset modification date?

Finally, edit the README and save it again.

.. find-out-more:: Let's go!

   .. runrecord:: _examples/cha-102-101-datasets-11
      :language: console
      :workdir: challenges/102-101-dataset/my-dataset

      $ echo "This is my example dataset" >> README.md
      $ datalad save -m "Add redundant explanation"

Challenge 4
"""""""""""

Download and save the following set of penguin images available at the URLs below into a dataset:

- ``chinstrap_01.jpg``: https://hub.datalad.org/edu/penguins/media/branch/main/examples/adelie.jpg
- ``chinstrap_02.jpg``: https://hub.datalad.org/edu/penguins/media/branch/main/examples/chinstrap.jpg

You can reuse the dataset from the previous challenge, or create a new one.
Can you do the download while recording provenance?

.. find-out-more:: Give me a hint about provenance

   Try using :dlcmd:`download-url` or `datalad-next's  "download" command <https://docs.datalad.org/projects/next/en/stable/generated/man/datalad-download.html>`_ combined with :dlcmd:`run`.

.. find-out-more:: Show me the entire solution

   You can download a file and save it manually:

   .. runrecord:: _examples/cha-102-101-datasets-12
      :language: console
      :workdir: challenges/102-101-dataset/my-dataset

      $ wget -q -O chinstrap_01.jpg "https://hub.datalad.org/edu/penguins/media/branch/main/examples/adelie.jpg"
      $ datalad save -m "Add manually downloaded images"

   Or download it recording its origin as provenance:

   .. runrecord:: _examples/cha-102-101-datasets-13
      :language: console
      :workdir: challenges/102-101-dataset/my-dataset

      $ datalad run -m "Add image from the web" " datalad download 'https://hub.datalad.org/edu/penguins/media/branch/main/examples/chinstrap.jpg'"

Run :term:`gitk` in the dataset.
Can you find the file identifier of any of the newly downloaded files?

Challenge 5
"""""""""""

Other than creating datasets on your own, DataLad allows to clone existing datasets, too.
Clone and explore the dataset from the following publication:

> *Wittkuhn, L., Schuck, N.W. Dynamics of fMRI patterns reflect sub-second activation sequences and reveal replay in human visual cortex. Nat Commun 12, 1795 (2021). https://doi.org/10.1038/s41467-021-21970-2*

You can find it at https://github.com/lnnrtwttkhn/highspeed-analysis.


.. find-out-more:: Show me how to clone it

   .. runrecord:: _examples/cha-102-101-datasets-14
      :language: console
      :workdir: challenges/102-101-dataset/

      $ datalad clone https://github.com/lnnrtwttkhn/highspeed-analysis.git

Explore the dataset:

- When was it created?
- When was it last updated?
- How many contributors does it have?
- How much annexed file content does it contain?
- How many subdatasets are there?

.. find-out-more:: Let's compare explorations

   When was it created?

   .. runrecord:: _examples/cha-102-101-datasets-15
      :language: console
      :workdir: challenges/102-101-dataset/

      $ cd highspeed-analysis
      # first commit
      $ git log $(git rev-list --max-parents=0 HEAD)

   When was it last updated?

   .. runrecord:: _examples/cha-102-101-datasets-16
      :language: console
      :workdir: challenges/102-101-dataset/highspeed-analysis

      # most recent commit
      $ git show

   How many contributors does it have?

   .. runrecord:: _examples/cha-102-101-datasets-17
      :language: console
      :workdir: challenges/102-101-dataset/highspeed-analysis

      # contributions by contributor
      $ git shortlog -s

   How much annexed file content does it contain?

   .. runrecord:: _examples/cha-102-101-datasets-18
      :language: console
      :workdir: challenges/102-101-dataset/highspeed-analysis

      $ datalad status --annex all

   How many subdatasets are there?

   .. runrecord:: _examples/cha-102-101-datasets-19
      :language: console
      :workdir: challenges/102-101-dataset/highspeed-analysis

      $ datalad subdatasets

Finally, get the annexed file content and drop it afterwards.

.. find-out-more:: Yeah, data!

   Get it...

   .. runrecord:: _examples/cha-102-101-datasets-20
      :language: console
      :workdir: challenges/102-101-dataset/highspeed-analysis

      $ datalad get .

   Drop it!

   .. runrecord:: _examples/cha-102-101-datasets-21
      :language: console
      :workdir: challenges/102-101-dataset/highspeed-analysis

      $ datalad drop .