This removes the need to have admin-level access to a machine for running the handbook code. This should make testing in a much broader range of environments possible (think HPC accounts, etc). The contributing guide and appveyor setup are adjusted.
210 lines
8.2 KiB
ReStructuredText
210 lines
8.2 KiB
ReStructuredText
.. index::
|
|
pair: create; DataLad command
|
|
pair: create dataset; with DataLad
|
|
.. _createDS:
|
|
|
|
Create a dataset
|
|
----------------
|
|
|
|
We are about to start the educational course ``DataLad-101``.
|
|
In order to follow along and organize course content, let us create
|
|
a directory on our computer to collate the materials, assignments, and
|
|
notes in.
|
|
|
|
Since this is ``DataLad-101``, let's do it as a :term:`DataLad dataset`.
|
|
You might associate the term "dataset" with a large spreadsheet containing
|
|
variables and data.
|
|
But for DataLad, a dataset is the core data type:
|
|
As noted in :ref:`philo`, a dataset is a collection of *files*
|
|
in folders, and a file is the smallest unit any dataset can contain.
|
|
Although this is a very simple concept, datasets come with many
|
|
useful features.
|
|
Because experiencing is more insightful than just reading, we will explore the
|
|
concepts of DataLad datasets together by creating one.
|
|
|
|
Find a nice place on your computer's file system to put a dataset for ``DataLad-101``,
|
|
and create a fresh, empty dataset with the :dlcmd:`create` command.
|
|
|
|
Note the command structure of :dlcmd:`create` (optional bits are enclosed in ``[ ]``):
|
|
|
|
.. code-block::
|
|
|
|
datalad create [--description "..."] [-c <config options>] PATH
|
|
|
|
.. _createdescription:
|
|
.. index::
|
|
pair: set description for dataset location; with DataLad
|
|
.. find-out-more:: What is the description option of 'datalad create'?
|
|
|
|
The optional ``--description`` flag allows you to provide a short description of
|
|
the *location* of your dataset, for example with
|
|
|
|
.. code-block:: console
|
|
|
|
$ datalad create --description "course on DataLad-101 on my private laptop" -c text2git DataLad-101
|
|
|
|
If you want, use the above command instead to provide a description. Its use will not be immediately clear now, but the chapter
|
|
:ref:`chapter_collaboration` shows where this description
|
|
ends up and how it may be useful.
|
|
|
|
Let's start:
|
|
|
|
.. index::
|
|
pair: create dataset; with DataLad
|
|
.. runrecord:: _examples/DL-101-101-101
|
|
:language: console
|
|
:workdir: dl-101
|
|
:env:
|
|
DATALAD_SEED=0
|
|
:realcommand: ( mkdir DataLad-101 && cd DataLad-101 && git init && git config annex.uuid 46b169aa-bb91-42d6-be06-355d957fb4f7 ) &> /dev/null && datalad create --force -c text2git DataLad-101
|
|
:cast: 01_dataset_basics
|
|
:notes: Datasets are datalads core data type. We will explore the concepts of datasets by creating one with datalad create. optional configuration template and a description
|
|
|
|
$ datalad create -c text2git DataLad-101
|
|
|
|
This will create a dataset called ``DataLad-101`` in the directory you are currently
|
|
in. For now, disregard ``-c text2git``. It applies a configuration template, but there
|
|
will be other parts of this book to explain this in detail.
|
|
|
|
Once created, a DataLad dataset looks like any other directory on your file system.
|
|
Currently, it seems empty.
|
|
|
|
.. runrecord:: _examples/DL-101-101-102
|
|
:language: console
|
|
:workdir: dl-101
|
|
:cast: 01_dataset_basics
|
|
:notes: DataLad informs about what it is doing during a command. At the end is a summary, in this case it is ok. What is inside of a newly created dataset? We list contents with ls.
|
|
|
|
$ cd DataLad-101
|
|
$ ls # ls does not show any output, because the dataset is empty.
|
|
|
|
However, all files and directories you store within the DataLad dataset
|
|
can be tracked (should you want them to be tracked).
|
|
*Tracking* in this context means that edits done to a file are automatically
|
|
associated with information about the change, the author of the edit,
|
|
and the time of this change. This is already important information on its own
|
|
-- the :term:`provenance` captured with this can, for example, be used to learn
|
|
about a file's lineage, and can establish trust in it.
|
|
But what is especially helpful is that previous states of files or directories
|
|
can be restored. Remember the last time you accidentally deleted content
|
|
in a file, but only realized *after* you saved it? With DataLad, no
|
|
mistakes are forever. We will see many examples of this later in the book,
|
|
and such information is stored in what we will refer
|
|
to as the *history* of a dataset.
|
|
|
|
.. index::
|
|
pair: log; Git command
|
|
pair: exit pager; in a terminal
|
|
pair: show history; with Git
|
|
|
|
This history is almost as small as it can be at the current state, but let's take
|
|
a look at it. For looking at the history, the code examples will use :gitcmd:`log`,
|
|
a built-in :term:`Git` command [#f1]_ that works right in your terminal. Your log
|
|
*might* be opened in a terminal :term:`pager`
|
|
that lets you scroll up and down with your arrow keys, but not enter any more commands.
|
|
If this happens, you can get out of ``git log`` by pressing ``q``.
|
|
|
|
.. runrecord:: _examples/DL-101-101-103
|
|
:language: console
|
|
:workdir: dl-101/DataLad-101
|
|
:emphasize-lines: 3-4, 6, 9-10, 12
|
|
:cast: 01_dataset_basics
|
|
:notes: GIT LOG, SHASUM, MESSAGE: A dataset is version controlled. This means, edits done to a file are associated with information about the change, the author, and the time + ability to restore previous states of the dataset. Let's take a look into the history, even if it is small atm
|
|
|
|
$ git log
|
|
|
|
We can see two :term:`commit`\s in the history of the repository.
|
|
Each of them is identified by a unique 40 character sequence, called a
|
|
:term:`shasum`.
|
|
|
|
.. index::
|
|
pair: log; Git command
|
|
pair: corresponding branch; in adjusted mode
|
|
pair: show history; on Windows
|
|
.. windows-wit:: Your Git log may be more extensive - use 'git log main' instead!
|
|
|
|
.. include:: topic/adjustedmode-log.rst
|
|
|
|
Highlighted in this output is information about the author and about
|
|
the time, as well as a :term:`commit message` that summarizes the
|
|
performed action concisely. In this case, both commit messages were written by
|
|
DataLad itself. The most recent change is on the top. The first commit
|
|
written to the history therefore states that a new dataset was created,
|
|
and the second commit is related to the ``-c text2git`` option (which
|
|
uses a configuration template to instruct DataLad to store text files
|
|
in Git, but more on this later).
|
|
While these commits were produced and described by DataLad,
|
|
in most other cases, you will have to create the commit and
|
|
an informative commit message yourself.
|
|
|
|
.. index::
|
|
pair: create dataset; DataLad concept
|
|
.. gitusernote:: Create internals
|
|
|
|
:dlcmd:`create` uses :gitcmd:`init` and :gitannexcmd:`init`. Therefore,
|
|
the DataLad dataset is a Git repository.
|
|
Large file content in the
|
|
dataset is tracked with git-annex. An ``ls -a``
|
|
reveals that Git has secretly done its work:
|
|
|
|
.. runrecord:: _examples/DL-101-101-104
|
|
:language: console
|
|
:workdir: dl-101/DataLad-101
|
|
:emphasize-lines: 4-6
|
|
:cast: 01_dataset_basics
|
|
:notes: DataLad, git-annex, and git create hidden files and directories in your dataset. Make sure to not delete them!
|
|
|
|
$ ls -a # show also hidden files
|
|
|
|
**For non-Git-Users: these hidden** *dot-directories* and *dot-files* **are necessary for all Git magic**
|
|
**to work. Please do not tamper with them, and, importantly,** *do not delete them.*
|
|
|
|
Congratulations, you just created your first DataLad dataset!
|
|
Let us now put some content inside.
|
|
|
|
.. only:: adminmode
|
|
|
|
Add a tag at the section end.
|
|
|
|
.. runrecord:: _examples/DL-101-101-105
|
|
:language: console
|
|
:workdir: dl-101/DataLad-101
|
|
|
|
$ git branch sct_create_a_dataset
|
|
|
|
.. rubric:: Footnotes
|
|
|
|
.. [#f1] A tool we can recommend as an alternative to :gitcmd:`log` is :term:`tig`.
|
|
Once installed, exchange any ``git log`` command you see here with the single word ``tig``.
|
|
|
|
|
|
.. ifconfig:: internal
|
|
|
|
create a script to help make push targets
|
|
|
|
.. runrecord:: _examples/DL-101-101-106
|
|
:language: console
|
|
:workdir: dl-101/DataLad-101
|
|
|
|
$ cat << EOT >| $HOME/makepushtarget.py
|
|
|
|
#!/usr/bin/python3
|
|
|
|
from datalad.core.distributed.tests.test_push import mk_push_target
|
|
from datalad.api import Dataset as ds
|
|
import sys
|
|
|
|
ds_path = sys.argv[1]
|
|
name = sys.argv[2]
|
|
path = sys.argv[3]
|
|
annex = sys.argv[4]
|
|
bare = sys.argv[5]
|
|
|
|
if __name__ == '__main__':
|
|
mk_push_target(ds=ds(ds_path),
|
|
name=name,
|
|
path=path,
|
|
annex=annex,
|
|
bare=bare)
|
|
|
|
EOT
|