This acts on the report by `tool/link_consolidator.py`. This changeset is large but conceptually simple: - no http:// links anymore, unless there is no https:// alternative - no redundant trailing slashes - no redundant "latest" flavors in doc links - resolve permanent redirects - if there is a simpler/shorter URL that points to the same information, go for that one - reduce duplicate specifications of URLs in a single file With all changes applied, the link consolidator still reports the following issues: - `docs/glossary.rst`: - *line 163*: `permanent-redirect` [https://try.github.io - permanently to https://docs.github.com/en/get-started/quickstart/set-up-git] - `docs/basics/101-127-yoda.rst`: - *line 453*: `permanent-redirect` [https://the-turing-way.netlify.app - permanently to https://the-turing-way.netlify.app/index.html] - `docs/basics/101-180-FAQ.rst`: - *line 128*: `none-https` [http://www.humanconnectomeproject.org] - `docs/beyond_basics/101-173-contributing.rst`: - *line 34*: `none-https` [http://source.git-annex.branchable.com/?p=source.git;a=summary] - `docs/usecases/HCP_dataset.rst`: - *line 63*: `none-https` [http://www.humanconnectomeproject.org/wp-content/uploads/2010/01/HCP_Data_Agreement.pdf] The non-https URLs have no alternative. The perma-redirects make the URLs needlessly long. So I decided to keep them.
512 lines
20 KiB
ReStructuredText
512 lines
20 KiB
ReStructuredText
OHBM Brainhack TrainTrack: DataLad
|
|
----------------------------------
|
|
|
|
This code belongs to the 2020 OHBM Brainhack Traintrack session on DataLad. Copy-paste
|
|
them into your terminal to follow along.
|
|
|
|
Introduction & set-up
|
|
^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
DataLad is a command line tool and it has a Python API. Whenever I use it,
|
|
I thus operate it in a terminal using the command line, or I use it in scripts
|
|
such as shell scripts, Python scripts, Jupyter Notebooks, and so forth.
|
|
In the command line, this always starts with the general ``datalad`` command::
|
|
|
|
datalad
|
|
|
|
For example, I can type ``datalad --help`` to find out more about
|
|
the available commands.
|
|
|
|
To use its python interface, I import the ``datalad.api as dl``::
|
|
|
|
ipython
|
|
import datalad.api as dl
|
|
exit
|
|
|
|
You can find more details about how to install DataLad and its dependencies on
|
|
all operating systems in the DataLad handbook, in the section :ref:`install`.
|
|
It also details how to install DataLad on shared machines that
|
|
you don't have administrative privileges (sudo rights) on, such as high
|
|
performance compute clusters.
|
|
If you already have datalad installed,
|
|
**make sure that it is a recent version, at least 0.12 or higher**::
|
|
|
|
datalad --version
|
|
|
|
|
|
The very first thing to do if you haven't done so yet is to configure your Git
|
|
identity. Don't worry if you have never used Git. The identity you are
|
|
configuring consists of your name and email-adress so that the changes that you
|
|
do to a project can be associated with you as an author of the changes::
|
|
|
|
git config --global --add user.name "Adina Wagner"
|
|
git config --global --add user.email "adina.wagner@t-online.de"
|
|
|
|
Creating a dataset is done with the ``datalad create`` command. This command only needs a name, and
|
|
it will subsequently create a new directory under this name and instruct DataLad
|
|
to manage it. Here, the command also has an additional option, the -c text2git
|
|
option. With the -c option, datasets can be configured in a certain way at the
|
|
time of creation. You can find out about the details of the text2git configuration
|
|
in the datalad handbook in sections :ref:`procedures`, but in general this configuration is a very useful
|
|
standard configuration for datasets::
|
|
|
|
datalad create -c text2git DataLad-101
|
|
|
|
Right after dataset creation, there is a new directory on the computer called
|
|
``DataLad-101``::
|
|
|
|
cd DataLad-101
|
|
ls # ls does not show any output, because the dataset is empty
|
|
|
|
Datasets have the exciting features that they can record
|
|
everything that is done inside of them, version control all content given to
|
|
Datalad, regardless of the size of this content, and have a complete history that
|
|
you can interact with. This history is already present, although it is very
|
|
short at this point in time. Let's check it out nevertheless.
|
|
|
|
This history exists thanks to Git. Therefore, you can access the history of a
|
|
dataset with any tool that shows you git
|
|
history. We'll stay basic and just use gits built-in ``git log`` command, but you
|
|
could also use tools with graphical user interfaces if you want to, for example
|
|
:term:`tig`::
|
|
|
|
git log
|
|
|
|
Version control workflows
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
I'll start by creating a books directory with the mkdir command, and then I will download two books
|
|
from the internet. Here, I'm using the command line tool wget to do this in
|
|
order to do everything from the commandline. But you can also just download the
|
|
book manually and save it into the dataset with a file manager if you are more
|
|
comfortable doing it this way. Remember, a dataset is just a directory on your
|
|
computer::
|
|
|
|
mkdir books
|
|
cd books && wget -nv https://sourceforge.net/projects/linuxcommand/files/TLCL/19.01/TLCL-19.01.pdf/download -O TLCL.pdf && wget -nv https://edisciplinas.usp.br/pluginfile.php/3252353/mod_resource/content/1/b_Swaroop_Byte_of_python.pdf -O byte-of-python.pdf && cd ../
|
|
|
|
The tree command can visualize the directory hierarchy::
|
|
|
|
tree
|
|
|
|
Use the datalad status command to find out what happened in the dataset. This
|
|
command is very helpful and reports on the current state of your dataset. Any
|
|
content that is new or changed will be highlighted. If nothing has changed, a
|
|
datalad status will report what is called a clean dataset state. And in general
|
|
it is very useful to always have a clean dataset state::
|
|
|
|
datalad status
|
|
|
|
Any content that
|
|
we want DataLad to manage needs to be explicitly given to DataLad, it is not
|
|
enough to simply put it inside of the dataset. To give new or changed content to
|
|
DataLad, we need to save it using datalad save. This is the first time that we
|
|
need to specify a commit message, and this is done with the -m option of the
|
|
command::
|
|
|
|
datalad save -m "add books on Python and Unix to read later"
|
|
|
|
With git log -n 1 you can take a look at the most recent commit in the history::
|
|
|
|
git log -p -n 1
|
|
|
|
"datalad save" saved all untracked contents to the
|
|
dataset. Sometimes this is inconvenient. One great advantage of a datasets
|
|
history is that it allows you to revert changes you are not happy with, but this
|
|
is only easily possible in the units of single commits. So if one save commits
|
|
several unrelated files or changes, they are hard to disentangle if you ever
|
|
want to revert some of those changes. But if you for example provide a path to
|
|
the file you want to save you can specify more precisely what will be saved
|
|
together::
|
|
|
|
cd books && wget -nv https://github.com/progit/progit2/releases/download/2.1.154/progit.pdf && cd ../
|
|
datalad status
|
|
|
|
Attach a path to the next ``datalad save`` command::
|
|
|
|
datalad save -m "add reference book about git" books/progit.pdf
|
|
|
|
Let's take a look at files that are
|
|
frequently modified such as code or text. To try this, I will create a file
|
|
and modify it. I do this with a `here doc <https://en.wikipedia.org/wiki/Here_document>`_,
|
|
but you can also write the note with an editor of your choice. If you execute this
|
|
code snippet, make sure you copy-paste everything, starting with ``cat`` and ending
|
|
with the second ``EOT``::
|
|
|
|
cat << EOT > notes.txt
|
|
One can create a new dataset with '"'"'datalad create PATH'"'"'.
|
|
The dataset is created empty
|
|
|
|
|
|
EOT
|
|
|
|
Datalad status will, as expected, say that there is a new untracked file in the
|
|
dataset::
|
|
|
|
datalad status
|
|
|
|
We can save it with datalad save command and a helpful commit message. As
|
|
its the only change in the dataset, there is no need to provide a path::
|
|
|
|
datalad save -m "Add notes on datalad create"
|
|
|
|
Let's now add another note to modify this file::
|
|
|
|
cat << EOT >> notes.txt
|
|
The command "datalad save [-m] PATH" saves the file
|
|
(modifications) to history. Note to self:
|
|
Always use informative, concise commit messages.
|
|
|
|
EOT
|
|
|
|
A datalad status reports the file not to be untracked, but because it
|
|
differs now from the state it was saved under it is reported to be modified::
|
|
|
|
datalad status
|
|
|
|
Let's save this::
|
|
|
|
datalad save -m "add note on datalad save"
|
|
|
|
If you take a look at the history of this file with git log, the history
|
|
neatly summarizes all of the changes that have been done::
|
|
|
|
git log -p -n 2
|
|
|
|
|
|
Dataset consumption and nesting
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
First, create a new subdirectory to be organized::
|
|
|
|
mkdir recordings
|
|
|
|
Afterwards, I'll install the dataset I am interested in, either from a path or
|
|
a URL. The dataset I want to install lives on GitHub, so in order to get it, I
|
|
will provide its URL to the datalad clone command. I'm also attaching a path to
|
|
where I want to have it installed to this call. Importantly I am installing this
|
|
dataset as a subdataset of DataLad-101, in other words I will nest the two
|
|
datasets inside of each other. This is done with the --dataset flag::
|
|
|
|
|
|
datalad clone --dataset . \
|
|
https://github.com/datalad-datasets/longnow-podcasts.git recordings/longnow
|
|
|
|
There are new directories in my DataLad/101 dataset, and in these new directories, there are
|
|
hundreds of mp3 files::
|
|
|
|
tree -d # we limit the output to directories
|
|
cd recordings/longnow/Long_Now__Seminars_About_Long_term_Thinking
|
|
ls
|
|
|
|
here is the crucial and incredibly
|
|
handy feature of DataLad datasets: At this point, after cloning, the dataset
|
|
has small files, for example the README, but larger files in it don't have any
|
|
file content yet. It only retrieved what we in a simplified way call file
|
|
availability metadata and shows that as the file hierarchy in the dataset. So
|
|
while I can read the file names and find out what the dataset contains, I don't
|
|
have the file contents yet. If I would try to play one of the recordings with the
|
|
vlc player, this would fail::
|
|
|
|
vlc Long_Now__Seminars_About_Long_term_Thinking/2003_11_15__Brian_Eno__The_Long_Now.mp3
|
|
|
|
This is a curious behavior, but there are many advantages to this. One is speed, and
|
|
another one is small disk usage. Here is the total size of this dataset::
|
|
|
|
cd ../ # in longnow/
|
|
du -sh # Unix command to show the size of contents
|
|
|
|
It's tiny! But we can also find out how large the dataset would be if we had all
|
|
of its contents with datalad status and the --annex flag. In total, there are
|
|
more than 15GB of podcasts you now have access to::
|
|
|
|
datalad status --annex
|
|
|
|
You can get individual or groups of files,
|
|
directories, or datasets with the datalad get command. This command retrieves
|
|
the content for you::
|
|
|
|
datalad get Long_Now__Seminars_About_Long_term_Thinking/2003_11_15__Brian_Eno__The_Long_Now.mp3
|
|
|
|
Content that is already present is not re-retrieved::
|
|
|
|
datalad get Long_Now__Seminars_About_Long_term_Thinking/2003_11_15__Brian_Eno__The_Long_Now.mp3 \Long_Now__Seminars_About_Long_term_Thinking/2003_12_13__Peter_Schwartz__The_Art_Of_The_Really_Long_View.mp3 \Long_Now__Seminars_About_Long_term_Thinking/2004_01_10__George_Dyson__There_s_Plenty_of_Room_at_the_Top__Long_term_Thinking_About_Large_scale_Computing.mp3
|
|
|
|
If you don't need the data locally anymore you can
|
|
drop the content from your dataset to save disk space::
|
|
|
|
datalad drop Long_Now__Seminars_About_Long_term_Thinking/2003_12_13__Peter_Schwartz__The_Art_Of_The_Really_Long_View.mp3
|
|
|
|
Afterwards, as long as DataLad knows where a file came from, its content can be retrieved
|
|
again::
|
|
|
|
datalad get Long_Now__Seminars_About_Long_term_Thinking/2003_12_13__Peter_Schwartz__The_Art_Of_The_Really_Long_View.mp3
|
|
|
|
Dataset nesting
|
|
^^^^^^^^^^^^^^^
|
|
|
|
Let's take a look into the history of the longnow subdataset:
|
|
We can see that it has preserved its history completely. This means that the data we
|
|
retrieved preserved all of its provenance::
|
|
|
|
git log --reverse
|
|
|
|
How does this look in the top-level dataset? If we query DataLad-101s history,
|
|
there will be no commit about mp3 files or any of the commits we have seen in
|
|
the subdataset. Instead, we can see that the super dataset recorded
|
|
the recordings|longnow dataset as a subdataset. This means that it recorded where this dataset
|
|
came from and what version it is in::
|
|
|
|
cd ../../
|
|
git log -p -n 1
|
|
|
|
The subproject commit registered the most recent commit of the subdataset, and thus
|
|
the subdataset version::
|
|
|
|
cd recordings/longnow
|
|
git log --oneline
|
|
cd ../../
|
|
|
|
More on data versioning, nesting, and a glimpse into reproducible paper
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
We'll clone a repository for a paper that shares the manuscript, code, and data::
|
|
|
|
cd ../
|
|
datalad clone git@github.com:psychoinformatics-de/paper-remodnav.git
|
|
|
|
The top-level dataset has many subdatasets. One of it, remodnav, is a dataset that contains the source code for a Python
|
|
package called remodnav used in eye tracking analyses::
|
|
|
|
cd paper-remodnav
|
|
datalad subdatasets
|
|
|
|
After cloning a dataset, its subdatasets will be known, but just as content is
|
|
not yet retrieved for files in datasets, subdatasets of datasets are not yet
|
|
installed. If I navigate into an uninstalled subdataset it will appear like an
|
|
empty directory::
|
|
|
|
cd remodnav
|
|
ls
|
|
|
|
In order to install a subdataset, I use datalad get::
|
|
|
|
datalad get --recursive --recursion-limit 2 -n .
|
|
ls
|
|
|
|
This command doesn't only retrieve file contents, but it also installs
|
|
subdatasets. So if you want to be really lazy, just run datalad get --recursive
|
|
-n in the root of a dataset to install all subdatasets that are available.
|
|
The -n option prevents get from downloading any data, so that only subdataset
|
|
are installed, but no data is downloaded. Here, the depth of recursion is limited.
|
|
For one, it would take a while to install all subdatasets, but the very raw eye tracking
|
|
dataset contains subject IDs that should not be shared, and therefore, this subdataset
|
|
is not accessible - if you try to install all subdatasets, the source eyetracking
|
|
data will throw an error, because it is not made publicly available.
|
|
|
|
Afterwards, you can see that the remodnav subdataset also contains further
|
|
subdatasets. In this case, these subdatasets contain data that is used for
|
|
testing and validating software performance::
|
|
|
|
datalad subdatasets
|
|
|
|
One of the validation
|
|
data subdatasets came form another lab that shared their data. After I was
|
|
almost finished with my paper, I found another paper that reported a mistake in
|
|
this data. The mistake was still present in the data I was using, though. So by
|
|
inspecting the history of this dataset you can see that at one point, I
|
|
contributed a fix that changed the data::
|
|
|
|
cd remodnav/tests/data/anderson_etal
|
|
git log -n 3
|
|
|
|
But because I can link subdatasets in precise version I can
|
|
consciously decide and openly record which version of the data I am using or
|
|
even test how much my results change by resetting the subdataset to an earlier
|
|
state or updating the dataset to a more recent version.
|
|
|
|
Reproducible analyses
|
|
^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
Not only can I version control data and consume data with datalad, I
|
|
can also create datasets with data analyses in a way that my future self
|
|
and others can easily and automatically recompute what was done::
|
|
|
|
cd ../../../../ # get out of the paper repository
|
|
|
|
First, create a new dataset, in this case with the yoda configuration::
|
|
|
|
datalad create -c yoda myanalysis
|
|
|
|
This sets up a helpful structure for my dataset with a code directory and some README files,
|
|
and applies helpful configurations::
|
|
|
|
cd myanalysis
|
|
tree
|
|
|
|
Read up more about the YODA principles and the yoda configuration in the section
|
|
:ref:`yoda`.
|
|
|
|
Next, install input data as a subdataset. For this, I created a
|
|
dataset with the Iris data and published it on Github. Here, we're installing it
|
|
into a directory ``input``::
|
|
|
|
datalad clone -d . git@github.com:datalad-handbook/iris_data.git input/
|
|
|
|
|
|
The last thing is code to run on the data and produce results. For this, here is a
|
|
k-means classification analysis script written in Python. You can find this analysis
|
|
in more detail in the section :ref:`yoda_project`::
|
|
|
|
cat << EOT > code/script.py
|
|
|
|
import pandas as pd
|
|
import seaborn as sns
|
|
import datalad.api as dl
|
|
from sklearn import model_selection
|
|
from sklearn.neighbors import KNeighborsClassifier
|
|
from sklearn.metrics import classification_report
|
|
|
|
data = "input/iris.csv"
|
|
|
|
# make sure that the data are obtained (get will also install linked sub-ds!):
|
|
dl.get(data)
|
|
|
|
# prepare the data as a pandas dataframe
|
|
df = pd.read_csv(data)
|
|
attributes = ["sepal_length", "sepal_width", "petal_length","petal_width", "class"]
|
|
df.columns = attributes
|
|
|
|
# create a pairplot to plot pairwise relationships in the dataset
|
|
plot = sns.pairplot(df, hue='"'"'class'"'"', palette='"'"'muted'"'"')
|
|
plot.savefig('"'"'pairwise_relationships.png'"'"')
|
|
|
|
# perform a K-nearest-neighbours classification with scikit-learn
|
|
# Step 1: split data in test and training dataset (20:80)
|
|
array = df.values
|
|
X = array[:,0:4]
|
|
Y = array[:,4]
|
|
test_size = 0.20
|
|
seed = 7
|
|
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y,
|
|
test_size=test_size,
|
|
random_state=seed)
|
|
# Step 2: Fit the model and make predictions on the test dataset
|
|
knn = KNeighborsClassifier()
|
|
knn.fit(X_train, Y_train)
|
|
predictions = knn.predict(X_test)
|
|
|
|
# Step 3: Save the classification report
|
|
report = classification_report(Y_test, predictions, output_dict=True)
|
|
df_report = pd.DataFrame(report).transpose().to_csv('"'"'prediction_report.csv'"'"')
|
|
|
|
EOT
|
|
|
|
|
|
So far the script is untracked::
|
|
|
|
datalad status
|
|
|
|
Let's save it with a datalad save command and also attach an identifier with the
|
|
``--version-tag`` flag::
|
|
|
|
datalad save -m "add script for kNN classification and plotting" --version-tag ready4analysis code/script.py
|
|
|
|
The challenge DataLad helps me to accomplish is running this script in a way
|
|
that links the script to the results it produces and the data it was computed
|
|
from. I can do this with the datalad run command. In principle, it is simple.
|
|
You start with a clean dataset::
|
|
|
|
datalad status
|
|
|
|
Then, give the command you would execute to datalad run, in this case ``python code/script.py``.
|
|
Datalad will take the command, run it, and save all of the changes in the
|
|
dataset that leads it to under the commit message specified with
|
|
the -m option. Thus, it associates the script with the results.
|
|
But it can be even more helpful. Here, we also specify the input data the command
|
|
needs and datalad will get the data beforehand. And we also specify the output
|
|
of the command. To understand fully what this does, please read chapters
|
|
:ref:`chapter_run` and :ref:`chapter_gitannex`, but specifying the outputs will allow me later to rerun
|
|
the command and let me update outdated results::
|
|
|
|
datalad run -m "analyze iris data with classification analysis" \
|
|
--input "input/iris.csv" \
|
|
--output "prediction_report.csv" \
|
|
--output "pairwise_relationships.png" \
|
|
"python3 code/script.py"
|
|
|
|
Datalad creates a commit in my history. This commit has my commit
|
|
message as a human readable summary of what was done, it contains the produced
|
|
output, and it has a machine readable record that contains information on the
|
|
input data, the results, and the command that was ran to create this result::
|
|
|
|
git log -n 1
|
|
|
|
This machine readable record is particularly helpful, because I can now instruct
|
|
datalad to rerun this command so that I don't have to memorize what I had done
|
|
and people I share my dataset with don't need to ask me how this result was
|
|
produced, by can simply let DataLad tell them.
|
|
|
|
This is done with the ``datalad rerun`` command. For this demonstration, I have
|
|
prepared this analysis dataset and published it to GitHub at
|
|
`github.com/adswa/my_analysis <https://github.com/adswa/myanalysis>`_::
|
|
|
|
cd ../
|
|
git clone git@github.com:adswa/myanalysis.git analysis_clone
|
|
|
|
I can clone this repository and give for example the checksum of the run command
|
|
to the ``datalad rerun`` command. DataLad will read the machine readable record of
|
|
what was done and recompute the exact same thing::
|
|
|
|
datalad rerun 71cb8c5
|
|
|
|
This allows others to very easily rerun my computations, but it also spares me
|
|
the need to remember how I executed my script, and I can ask for results where they
|
|
came from::
|
|
|
|
git log pairwise_relationships.png
|
|
|
|
|
|
Computational reproducibility
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
If you don't have the required python packages available, running the script and
|
|
computing the results will fail. In order to be computationally reproducible
|
|
I need to attach the software that is necessary for computation to this
|
|
execution record::
|
|
|
|
cd ../myanalysis
|
|
|
|
And the way I can do this is with a datalad extension called datalad containers.
|
|
You can install this extension with pip by running
|
|
``pip install datalad-containers``.
|
|
This extension allows attaching software containers such as singularity
|
|
images to my dataset and execute my commands inside of these containers. Thus, I
|
|
can share data, code, code execution, and software.
|
|
|
|
Here is how this works: First, I attach a software container to my dataset using
|
|
``datalad containers-add`` with a name of the container (here I call it ``software``)
|
|
and a url or path where to find this container, here it is singularity hub. This
|
|
records the software in the dataset::
|
|
|
|
datalad containers-add software --url shub://adswa/resources:2
|
|
|
|
Note: You need to have `singularity <https://docs.sylabs.io/guides/3.5/user-guide>`_
|
|
installed to run this!
|
|
|
|
Afterwards, rerun the analysis in the software container with the
|
|
``datalad containers-run`` command. This container works just as the run command before, I
|
|
only need to specify the container name. If you were to rerun such an analysis,
|
|
datalad would not only retrieve the input data but also the software container::
|
|
|
|
datalad containers-run -m "rerun analysis in container" \
|
|
--container-name software \
|
|
--input "input/iris.csv" \
|
|
--output "prediction_report.csv" \
|
|
--output "pairwise_relationships.png" \
|
|
"python3 code/script.py"
|
|
|
|
Read more about this in the section :ref:`containersrun`.
|
|
|
|
|
|
**Done! Thanks for coding along!**
|