datalad-handbook/docs/basics/101-101-create.rst
Michael Hanke 88be5943f8
Replace hard-coded /home/me with env var setting
This removes the need to have admin-level access to a machine for
running the handbook code. This should make testing in a much broader
range of environments possible (think HPC accounts, etc).

The contributing guide and appveyor setup are adjusted.
2025-06-11 17:02:55 +02:00

210 lines
8.2 KiB
ReStructuredText

.. index::
pair: create; DataLad command
pair: create dataset; with DataLad
.. _createDS:
Create a dataset
----------------
We are about to start the educational course ``DataLad-101``.
In order to follow along and organize course content, let us create
a directory on our computer to collate the materials, assignments, and
notes in.
Since this is ``DataLad-101``, let's do it as a :term:`DataLad dataset`.
You might associate the term "dataset" with a large spreadsheet containing
variables and data.
But for DataLad, a dataset is the core data type:
As noted in :ref:`philo`, a dataset is a collection of *files*
in folders, and a file is the smallest unit any dataset can contain.
Although this is a very simple concept, datasets come with many
useful features.
Because experiencing is more insightful than just reading, we will explore the
concepts of DataLad datasets together by creating one.
Find a nice place on your computer's file system to put a dataset for ``DataLad-101``,
and create a fresh, empty dataset with the :dlcmd:`create` command.
Note the command structure of :dlcmd:`create` (optional bits are enclosed in ``[ ]``):
.. code-block::
datalad create [--description "..."] [-c <config options>] PATH
.. _createdescription:
.. index::
pair: set description for dataset location; with DataLad
.. find-out-more:: What is the description option of 'datalad create'?
The optional ``--description`` flag allows you to provide a short description of
the *location* of your dataset, for example with
.. code-block:: console
$ datalad create --description "course on DataLad-101 on my private laptop" -c text2git DataLad-101
If you want, use the above command instead to provide a description. Its use will not be immediately clear now, but the chapter
:ref:`chapter_collaboration` shows where this description
ends up and how it may be useful.
Let's start:
.. index::
pair: create dataset; with DataLad
.. runrecord:: _examples/DL-101-101-101
:language: console
:workdir: dl-101
:env:
DATALAD_SEED=0
:realcommand: ( mkdir DataLad-101 && cd DataLad-101 && git init && git config annex.uuid 46b169aa-bb91-42d6-be06-355d957fb4f7 ) &> /dev/null && datalad create --force -c text2git DataLad-101
:cast: 01_dataset_basics
:notes: Datasets are datalads core data type. We will explore the concepts of datasets by creating one with datalad create. optional configuration template and a description
$ datalad create -c text2git DataLad-101
This will create a dataset called ``DataLad-101`` in the directory you are currently
in. For now, disregard ``-c text2git``. It applies a configuration template, but there
will be other parts of this book to explain this in detail.
Once created, a DataLad dataset looks like any other directory on your file system.
Currently, it seems empty.
.. runrecord:: _examples/DL-101-101-102
:language: console
:workdir: dl-101
:cast: 01_dataset_basics
:notes: DataLad informs about what it is doing during a command. At the end is a summary, in this case it is ok. What is inside of a newly created dataset? We list contents with ls.
$ cd DataLad-101
$ ls # ls does not show any output, because the dataset is empty.
However, all files and directories you store within the DataLad dataset
can be tracked (should you want them to be tracked).
*Tracking* in this context means that edits done to a file are automatically
associated with information about the change, the author of the edit,
and the time of this change. This is already important information on its own
-- the :term:`provenance` captured with this can, for example, be used to learn
about a file's lineage, and can establish trust in it.
But what is especially helpful is that previous states of files or directories
can be restored. Remember the last time you accidentally deleted content
in a file, but only realized *after* you saved it? With DataLad, no
mistakes are forever. We will see many examples of this later in the book,
and such information is stored in what we will refer
to as the *history* of a dataset.
.. index::
pair: log; Git command
pair: exit pager; in a terminal
pair: show history; with Git
This history is almost as small as it can be at the current state, but let's take
a look at it. For looking at the history, the code examples will use :gitcmd:`log`,
a built-in :term:`Git` command [#f1]_ that works right in your terminal. Your log
*might* be opened in a terminal :term:`pager`
that lets you scroll up and down with your arrow keys, but not enter any more commands.
If this happens, you can get out of ``git log`` by pressing ``q``.
.. runrecord:: _examples/DL-101-101-103
:language: console
:workdir: dl-101/DataLad-101
:emphasize-lines: 3-4, 6, 9-10, 12
:cast: 01_dataset_basics
:notes: GIT LOG, SHASUM, MESSAGE: A dataset is version controlled. This means, edits done to a file are associated with information about the change, the author, and the time + ability to restore previous states of the dataset. Let's take a look into the history, even if it is small atm
$ git log
We can see two :term:`commit`\s in the history of the repository.
Each of them is identified by a unique 40 character sequence, called a
:term:`shasum`.
.. index::
pair: log; Git command
pair: corresponding branch; in adjusted mode
pair: show history; on Windows
.. windows-wit:: Your Git log may be more extensive - use 'git log main' instead!
.. include:: topic/adjustedmode-log.rst
Highlighted in this output is information about the author and about
the time, as well as a :term:`commit message` that summarizes the
performed action concisely. In this case, both commit messages were written by
DataLad itself. The most recent change is on the top. The first commit
written to the history therefore states that a new dataset was created,
and the second commit is related to the ``-c text2git`` option (which
uses a configuration template to instruct DataLad to store text files
in Git, but more on this later).
While these commits were produced and described by DataLad,
in most other cases, you will have to create the commit and
an informative commit message yourself.
.. index::
pair: create dataset; DataLad concept
.. gitusernote:: Create internals
:dlcmd:`create` uses :gitcmd:`init` and :gitannexcmd:`init`. Therefore,
the DataLad dataset is a Git repository.
Large file content in the
dataset is tracked with git-annex. An ``ls -a``
reveals that Git has secretly done its work:
.. runrecord:: _examples/DL-101-101-104
:language: console
:workdir: dl-101/DataLad-101
:emphasize-lines: 4-6
:cast: 01_dataset_basics
:notes: DataLad, git-annex, and git create hidden files and directories in your dataset. Make sure to not delete them!
$ ls -a # show also hidden files
**For non-Git-Users: these hidden** *dot-directories* and *dot-files* **are necessary for all Git magic**
**to work. Please do not tamper with them, and, importantly,** *do not delete them.*
Congratulations, you just created your first DataLad dataset!
Let us now put some content inside.
.. only:: adminmode
Add a tag at the section end.
.. runrecord:: _examples/DL-101-101-105
:language: console
:workdir: dl-101/DataLad-101
$ git branch sct_create_a_dataset
.. rubric:: Footnotes
.. [#f1] A tool we can recommend as an alternative to :gitcmd:`log` is :term:`tig`.
Once installed, exchange any ``git log`` command you see here with the single word ``tig``.
.. ifconfig:: internal
create a script to help make push targets
.. runrecord:: _examples/DL-101-101-106
:language: console
:workdir: dl-101/DataLad-101
$ cat << EOT >| $HOME/makepushtarget.py
#!/usr/bin/python3
from datalad.core.distributed.tests.test_push import mk_push_target
from datalad.api import Dataset as ds
import sys
ds_path = sys.argv[1]
name = sys.argv[2]
path = sys.argv[3]
annex = sys.argv[4]
bare = sys.argv[5]
if __name__ == '__main__':
mk_push_target(ds=ds(ds_path),
name=name,
path=path,
annex=annex,
bare=bare)
EOT