datalad-handbook/docs/basics/101-105-install.rst

.. index::
   pair: clone; DataLad command
   pair: clone dataset; with DataLad
.. _installds:

Install datasets
----------------

So far, we have created a ``DataLad-101`` course dataset. We saved some additional readings
into the dataset, and have carefully made and saved notes on the DataLad
commands we discovered. Up to this point, we therefore know the typical, *local*
workflow to create and populate a dataset from scratch.

But we've been told that with DataLad we could very easily get vast amounts of data to our
computer. Rumor has it that this would be only a single command in the terminal!
Therefore, everyone in today's lecture excitedly awaits today's topic: Installing datasets.

"With DataLad, users can install *clones* of existing DataLad datasets from paths, URLs, or
open-data collections" our lecturer begins.
"This makes accessing data fast and easy. A dataset that others could install can be
created by anyone, without a need for additional software. Your own datasets can be
installed by others, should you want that, for example. Therefore, not only accessing
data becomes fast and easy, but also *sharing*."
"That's so cool!", you think. "Exam preparation will be a piece of cake if all of us
can share our mid-term and final projects easily!"
"But today, let's only focus on how to install a dataset", she continues.
"Damn it! Can we not have longer lectures?", you think and set alarms to all of the
upcoming lecture dates in your calendar.
There is so much exciting stuff to come, you cannot miss a single one.

"Psst!" a student from the row behind reaches over. "There are
a bunch of audio recordings of a really cool podcast, and they have been shared in the form
of a DataLad dataset! Shall we try whether we can install that?"

"Perfect! What a great way to learn how to install a dataset. Doing it
now instead of looking at slides for hours is my preferred type of learning anyway",
you think as you fire up your terminal and navigate into your ``DataLad-101`` dataset.

In this demonstration, we are using one of the many openly available datasets that
DataLad provides in a public registry that anyone can access. One of these datasets is a
collection of audio recordings of a great podcast, the longnow seminar series [#f2]_.
It consists of audio recordings about long-term thinking, and while the DataLad-101
course is not a long-term thinking seminar, those recordings are nevertheless a
good addition to the large stash of yet-to-read text books we piled up.
Let's get this dataset into our existing ``DataLad-101`` dataset.

To keep the ``DataLad-101`` dataset neat and organized, we first create a new directory,
called recordings.

.. runrecord:: _examples/DL-101-105-101
   :language: console
   :workdir: dl-101/DataLad-101
   :cast: 01_dataset_basics
   :notes: The next challenge is to clone an existing dataset from the web as a subdataset. First, we create a location for this

   $ # we are in the root of DataLad-101
   $ mkdir recordings


The command that can be used to obtain a dataset is :dlcmd:`clone`,
but we often refer to the process of cloning a Dataset as *installing*.
Let's install the longnow podcasts in this new directory.

The :dlcmd:`clone` command takes a location of an existing dataset to clone. This *source*
can be a URL or a path to a local directory, or an SSH server [#f1]_. The dataset
to be installed lives on :term:`GitHub`, at
`https://github.com/datalad-datasets/longnow-podcasts.git <https://github.com/datalad-datasets/longnow-podcasts>`_,
and we can give its GitHub URL as the  first positional argument.
Optionally, the command also takes as second positional argument a path to the *destination*,
-- a path to where we want to install the dataset to. In this case it is ``recordings/longnow``.
Because we are installing a dataset (the podcasts) into an existing dataset (the ``DataLad-101``
dataset), we also supply a ``-d/--dataset`` flag to the command.
This specifies the dataset to perform the operation on, and allows us to install
the podcasts as a *subdataset* of ``DataLad-101``. Because we are in the root
of the ``DataLad-101`` dataset, the pointer to the dataset is a ``.`` (which is Unix'
way of saying "current directory").

As before with long commands, we line break the code with a ``\``. You can
copy it as it is presented here into your terminal, but in your own work you
can write commands like this into a single line.

.. runrecord:: _examples/DL-101-105-102
   :language: console
   :workdir: dl-101/DataLad-101/
   :cast: 01_dataset_basics
   :notes: We need to clone the dataset as a subdataset. For this, we use the datalad clone command with a --dataset option and a path. Else the dataset would not be registered as a subdataset!

   $ datalad clone --dataset . \
    https://github.com/datalad-datasets/longnow-podcasts.git recordings/longnow

This command copied the repository found at the URL https://github.com/datalad-datasets/longnow-podcasts
into the existing ``DataLad-101`` dataset, into the directory ``recordings/longnow``.
The optional destination is helpful: If we had not specified the path
``recordings/longnow`` as a destination for the dataset clone, the command would
have installed the dataset into the root of the ``DataLad-101`` dataset, and instead
of ``longnow`` it would have used the name of the remote repository "``longnow-podcasts``".
But the coolest feature of :dlcmd:`clone` is yet invisible: This command
also recorded where this dataset came from, thus capturing its *origin* as
:term:`provenance`. Even though this is not obvious at this point in time, later
chapters in this handbook will demonstrate how useful this information can be.

.. index::
   pair: clone; DataLad concept
.. gitusernote:: Clone internals

   The :dlcmd:`clone` command uses :gitcmd:`clone`.
   A dataset that is installed from an existing source, e.g., a path or URL,
   is the DataLad equivalent of a *clone* in Git.

.. index::
   pair: clone into another dataset; with DataLad
.. find-out-more:: Do I have to install from the root of datasets?

   No. Instead of from the *root* of the ``DataLad-101`` dataset, you could have also
   installed the dataset from within the ``recordings``, or ``books`` directory.
   In the case of installing datasets into existing datasets you however need
   to adjust the paths that are given with the ``-d/--dataset`` option:
   ``-d`` needs to specify the path to the root of the dataset. This is
   important to keep in mind whenever you do not execute the :dlcmd:`clone` command
   from the root of this dataset. Luckily, there is a shortcut: ``-d^`` will always
   point to root of the top-most dataset. For example, if you navigate into ``recordings``,
   the command would be:

   .. code-block:: console

     $ datalad clone -d^ https://github.com/datalad-datasets/longnow-podcasts.git longnow

.. find-out-more:: What if I do not install into an existing dataset?

   If you do not install into an existing dataset, you only need to omit the ``-d/--dataset``
   option. You can try:

   .. code-block:: console

     $ datalad clone https://github.com/datalad-datasets/longnow-podcasts.git

   anywhere outside of your ``DataLad-101`` dataset to install the podcast dataset into a new directory
   called ``longnow-podcasts``. You could even do this inside of an existing dataset.
   However, whenever you install datasets into of other datasets, the ``-d/--dataset``
   option is necessary to not only install the dataset, but also *register* it
   automatically into the higher level *superdataset*. The upcoming section will
   elaborate on this.

Here is the repository structure:

.. index::
   pair: tree; terminal command
   pair: display directory tree; on Windows
.. windows-wit:: use tree

   .. include:: topic/tree-windows.rst

.. runrecord:: _examples/DL-101-105-103
   :language: console
   :workdir: dl-101/DataLad-101
   :cast: 01_dataset_basics
   :notes: Let's take a look at the directory structure after cloning

   $ tree -d   # we limit the output to directories

We can see that ``recordings`` has one subdirectory, our newly installed ``longnow``
dataset with two subdirectories.
If we navigate into one of them and list its content, we'll see many ``.mp3`` files (here is an excerpt).

.. runrecord:: _examples/DL-101-105-104
   :language: console
   :workdir: dl-101/DataLad-101/
   :lines: 1-15
   :cast: 01_dataset_basics
   :notes: And now lets look into these seminar series folders: There are hundreds of mp3 files, yet the download only took a few seconds! How can that be?

   $ cd recordings/longnow/Long_Now__Seminars_About_Long_term_Thinking
   $ ls


Dataset content identity and availability information
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Surprised, you turn to your fellow student and wonder about
how fast the dataset was installed. Should
a download of that many ``.mp3`` files not take much more time?

Here you can see another import feature of DataLad datasets
and the :dlcmd:`clone` command:
Upon installation of a DataLad dataset, DataLad retrieves only small files
(for example, text files or markdown files) and (small) metadata
about the dataset. It does not, however, download any large files
(yet). The metadata exposes the dataset's file hierarchy
for exploration (note how you are able to list the dataset contents with ``ls``),
and downloading only this metadata speeds up the installation of a DataLad dataset
of many TB in size to a few seconds. Just now, after installing, the dataset is
small in size:

.. index::
   pair: show file size; in a terminal
.. runrecord:: _examples/DL-101-105-105
   :language: console
   :workdir: dl-101/DataLad-101/recordings/longnow/Long_Now__Seminars_About_Long_term_Thinking
   :cast: 01_dataset_basics
   :notes: Upon cloning of a DataLad dataset, DataLad retrieves only small files and metadata. Therefore the dataset is tiny in size. The files are non-functional now atm (Try opening one)

   $ cd ../      # in longnow/
   $ du -sh      # Unix command to show size of contents

This is tiny indeed!

If you executed the previous ``ls`` command in your own terminal, you might have seen
the ``.mp3`` files highlighted in a different color than usually.
On your computer, try to open one of the ``.mp3`` files.
You will notice that you cannot open any of the audio files.
This is not your fault: *None of these files exist on your computer yet*.

Wait, what?

This sounds strange, but it has many advantages. Apart from a fast installation,
it allows you to retrieve precisely the content you need, instead of all the contents
of a dataset. Thus, even if you install a dataset that is many TB in size,
it takes up only few MB of space after the install, and you can retrieve only those
components of the dataset that you need.

Let's see how large the dataset would be in total if all of the files were present.
For this, we supply an additional option to :dlcmd:`status`. Make sure to be
(somewhere) inside of the ``longnow`` dataset to execute the following command:

.. runrecord:: _examples/DL-101-105-106
   :language: console
   :workdir: dl-101/DataLad-101/recordings/longnow
   :cast: 01_dataset_basics
   :notes: But how large would the dataset be if we had all the content?

   $ datalad status --annex

Woah! More than 200 files, totaling more than 15 GB?
You begin to appreciate that DataLad did not
download all of this data right away! That would have taken hours given the crappy
internet connection in the lecture hall, and you are not even sure whether your
hard drive has much space left...


But you nevertheless are curious on how to actually listen to one of these ``.mp3``\s now.
So how does one actually "get" the files?

.. index::
   pair: get; DataLad command

The command to retrieve file content is :dlcmd:`get`.
You can specify one or more specific files, or ``get`` all of the dataset by
specifying :dlcmd:`get .` at the root directory of the dataset (with ``.`` denoting "current directory").

.. index::
   pair: get file content; with DataLad

First, we get one of the recordings in the dataset -- take any one of your choice
(here, it's the first).

.. runrecord:: _examples/DL-101-105-107
   :language: console
   :workdir: dl-101/DataLad-101/recordings/longnow
   :cast: 01_dataset_basics
   :notes: Now let's finally get some content in this dataset. This is done with the datalad get command

   $ datalad get Long_Now__Seminars_About_Long_term_Thinking/2003_11_15__Brian_Eno__The_Long_Now.mp3

Try to open it -- it will now work.

If you would want to get the rest of the missing data, instead of specifying all files individually,
we can use ``.`` to refer to *all* of the dataset like this:

.. code-block:: console

   $ datalad get .

However, with a total size of more than 15GB, this might take a while, so do not do that now.
If you did execute the command above, interrupt it by pressing ``CTRL`` + ``C`` -- Do not worry,
this will not break anything.

.. index::
   pair: show dataset size; with DataLad

Isn't that easy?
Let's see how much content is now present locally. For this, :dlcmd:`status --annex all`
has a nice summary:

.. runrecord:: _examples/DL-101-105-108
   :language: console
   :workdir: dl-101/DataLad-101/recordings/longnow
   :cast: 01_dataset_basics
   :notes: DataLad status can also summarize how much of the content is already present locally:

   $ datalad status --annex all

This shows you how much of the total content is present locally. With one file,
it is only a fraction of the total size.

Let's ``get`` a few more recordings, just because it was so mesmerizing to watch
DataLad's fancy progress bars.

.. runrecord:: _examples/DL-101-105-109
   :language: console
   :workdir: dl-101/DataLad-101/recordings/longnow
   :cast: 01_dataset_basics
   :notes: Let's get a few more files. Note how already obtained files are not downloaded again:

   $ datalad get Long_Now__Seminars_About_Long_term_Thinking/2003_11_15__Brian_Eno__The_Long_Now.mp3 \
   Long_Now__Seminars_About_Long_term_Thinking/2003_12_13__Peter_Schwartz__The_Art_Of_The_Really_Long_View.mp3 \
   Long_Now__Seminars_About_Long_term_Thinking/2004_01_10__George_Dyson__There_s_Plenty_of_Room_at_the_Top__Long_term_Thinking_About_Large_scale_Computing.mp3

Note that any data that is already retrieved (the first file) is not downloaded again.
DataLad summarizes the outcome of the execution of ``get`` in the end and informs
that the download of one file was ``notneeded`` and the retrieval of the other files was ``ok``.


.. index::
   pair: get; DataLad concept
.. gitusernote:: Get internals

   :dlcmd:`get` uses :gitannexcmd:`get` underneath the hood.

.. index::
   pair: drop file content; with DataLad

Keep whatever you like
^^^^^^^^^^^^^^^^^^^^^^

"Oh shit, oh shit, oh shit..." you hear from right behind you. Your fellow student
apparently downloaded the *full* dataset accidentally. "Is there a way to get rid
of file contents in dataset, too?", they ask. "Yes", the lecturer responds,
"you can remove file contents by using :dlcmd:`drop`. This is
really helpful to save disk space for data you can easily reobtain, for example".

.. index::
   pair: drop; DataLad command

The :dlcmd:`drop` command will remove
file contents completely from your dataset.
You should only use this command to remove contents that you can :dlcmd:`get`
again, or generate again (for example, with next chapter's :dlcmd:`run`
command), or that you really do not need anymore.

Let's remove the content of one of the files that we have downloaded, and check
what this does to the total size of the dataset. Here is the current amount of
retrieved data in this dataset:

.. runrecord:: _examples/DL-101-105-110
   :language: console
   :workdir: dl-101/DataLad-101/recordings/longnow

   $ datalad status --annex all

We drop a single recording's content that we previously downloaded with
:dlcmd:`get` ...

.. runrecord:: _examples/DL-101-105-111
   :language: console
   :workdir: dl-101/DataLad-101/recordings/longnow

   $ datalad drop Long_Now__Seminars_About_Long_term_Thinking/2004_01_10__George_Dyson__There_s_Plenty_of_Room_at_the_Top__Long_term_Thinking_About_Large_scale_Computing.mp3

... and check the size of the dataset again:

.. runrecord:: _examples/DL-101-105-112
   :language: console
   :workdir: dl-101/DataLad-101/recordings/longnow

   $ datalad status --annex all

Dropping the file content of one ``mp3`` file saved roughly 40MB of disk space.
Whenever you need the recording again, it is easy to re-retrieve it:

.. runrecord:: _examples/DL-101-105-113
   :language: console
   :workdir: dl-101/DataLad-101/recordings/longnow

   $ datalad get Long_Now__Seminars_About_Long_term_Thinking/2004_01_10__George_Dyson__There_s_Plenty_of_Room_at_the_Top__Long_term_Thinking_About_Large_scale_Computing.mp3

Reobtained!

This was only a quick digression into :dlcmd:`drop`. The main principles
of this command will become clear after chapter
:ref:`chapter_gitannex`, and its precise use is shown in the paragraph on
:ref:`removing file contents <remove>`.
At this point, however, you already know that datasets allow you to
:dlcmd:`drop` file contents flexibly. If you want to, you could have more
podcasts (or other data) on your computer than you have disk space available
by using DataLad datasets -- and that really is a cool feature to have.

Dataset archeology
^^^^^^^^^^^^^^^^^^

You have now experienced how easy it is to (re)obtain shared data with DataLad.
But beyond sharing only the *data* in the dataset, when sharing or installing
a DataLad dataset, all copies also include the dataset's *history*.

.. index::
   pair: log; Git command
   pair: show history (reverse); with Git

For example, we can find out who created the dataset in the first place
(the output shows an excerpt of ``git log --reverse``, which displays the
history from first to most recent commit):

.. runrecord:: _examples/DL-101-105-114
   :language: console
   :workdir: dl-101/DataLad-101/recordings/longnow
   :emphasize-lines: 3
   :lines: 1-13
   :cast: 01_dataset_basics
   :notes: On Dataset nesting: You have seen the history of DataLad-101. But the subdataset has a standalone history as well! We can find out who created it!


   $ git log --reverse

But that's not all. The seminar series is ongoing, and more recordings can get added
to the original repository shared on GitHub.
Because an installed dataset knows the dataset it was installed from,
your local dataset clone can be updated from its origin, and thus get the new recordings,
should there be some. Later in this handbook, we will see examples of this.

.. index::
   pair: update heredoc; in a terminal
   pair: save dataset modification; with DataLad

Now you can not only create datasets and work with them locally, you can also consume
existing datasets by installing them. Because that's cool, and because you will use this
command frequently, make a note of it into your ``notes.txt``, and :dlcmd:`save` the
modification.

.. runrecord:: _examples/DL-101-105-115
   :language: console
   :workdir: dl-101/DataLad-101/recordings/longnow
   :cast: 01_dataset_basics
   :notes: We can make a note about this:

   $ # in the root of DataLad-101:
   $ cd ../../
   $ cat << EOT >> notes.txt
   The command 'datalad clone URL/PATH [PATH]' installs a dataset from
   e.g., a URL or a path. If you install a dataset into an existing
   dataset (as a subdataset), remember to specify the root of the
   superdataset with the '-d' option.

   EOT
   $ datalad save -m "Add note on datalad clone"

.. index::
   pair: placeholder files; on Mac
.. importantnote:: Empty files can be confusing

   Listing files directly after the installation of a dataset will
   work if done in a terminal with ``ls``.
   However, certain file managers (such as OSX's Finder [#f3]_) may fail to
   display files that are not yet present locally (i.e., before a
   :dlcmd:`get` was run). Therefore, be  mindful when exploring
   a dataset hierarchy with a file manager -- it might not show you
   the available but not yet retrieved files.
   Consider browsing datasets with the :term:`DataLad Gooey` to be on the safe side.
   More about why this is will be explained in section :ref:`symlink`.


.. only:: adminmode

   Add a tag at the section end.

     .. runrecord:: _examples/DL-101-105-116
        :language: console
        :workdir: dl-101/DataLad-101

        $ git branch sct_install_datasets


.. rubric:: Footnotes

.. [#f1] Additionally, a source can also be a pointer to an open-data collection,
         for example :term:`the DataLad superdataset ///` -- more on what this is and how to
         use it later, though.

.. [#f2] The longnow podcasts are lectures and conversations on long-term thinking produced by
         the LongNow foundation and we can wholeheartedly recommend them for their worldly
         wisdoms and compelling, thoughtful ideas. Subscribe to the podcasts at https://longnow.org/seminars/podcast.
         Support the foundation by becoming a member: https://longnow.org/join.

.. [#f3] You can also upgrade your file manager to display file types in a
         DataLad datasets (e.g., with the
         `git-annex-turtle extension <https://github.com/andrewringler/git-annex-turtle>`_
         for Finder)