818 lines
38 KiB
HTML
818 lines
38 KiB
HTML
<!doctype html>
|
|
<html>
|
|
<head>
|
|
<meta charset="utf-8">
|
|
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no">
|
|
|
|
<!-- Edit me start! -->
|
|
<title>This is where your title goes</title>
|
|
<meta name="description" content=" This is where you put a short description ">
|
|
<meta name="author" content=" Your Name ">
|
|
<!-- Edit me end! -->
|
|
|
|
<link rel="stylesheet" href="../reveal.js/dist/reset.css">
|
|
<link rel="stylesheet" href="../reveal.js/dist/reveal.css">
|
|
<link rel="stylesheet" href="../reveal.js/dist/theme/beige.css">
|
|
|
|
<!-- Theme used for syntax highlighted code -->
|
|
<link rel="stylesheet" href="../reveal.js/plugin/highlight/monokai.css">
|
|
</head>
|
|
<body>
|
|
<div class="reveal">
|
|
<div class="slides">
|
|
|
|
<!--...Datalad Basics...-->
|
|
|
|
<section>
|
|
<section>
|
|
<script src="https://cdn.logwork.com/widget/countdown.js"></script>
|
|
<a href="https://logwork.com/countdown-2zu8" class="countdown-timer"
|
|
data-style="columns" data-timezone="Europe/Berlin" data-date="2022-04-21 09:45">
|
|
"Motivation & Basics of version control" starts in </a>
|
|
</section>
|
|
</section>
|
|
|
|
<section>
|
|
<section>
|
|
<h2>Participation modes </h2>
|
|
<iframe src="https://www.directpoll.com/r?XDbzPBd3ixYqg8huKIwKuJ7aj5lQw7fByQ4HgMgN",
|
|
style="border: 0" width="800" height="800"></iframe>
|
|
</section>
|
|
|
|
|
|
<section>
|
|
<h2>Prerequisites: Installation and Configuration</h2>
|
|
<ul style="font-size:30px">
|
|
<li data-fragment-index="1" class="fragment fade-in">Your installed version of DataLad should be 0.16.1</li>
|
|
<pre class="fragment fade-in" data-fragment-index="1"><code data-fragment-index="1" class="fragment fade-in">datalad --version
|
|
0.16.1</code></pre>
|
|
|
|
<li data-fragment-index="2" class="fragment fade-in">DataLad relies on Git to create a revision history with detailed information on
|
|
what was changes, when, and how. Therefore, you should tell Git who you are and
|
|
configure a Git identity (name and email)</li>
|
|
<pre data-fragment-index="2" class="fragment fade-in"><code data-fragment-index="2" class="fragment fade-in" class="bash">$ git config --list
|
|
user.name=Adina Wagner
|
|
user.email=adina.wagner@t-online.de
|
|
[...]
|
|
</code></pre>
|
|
<li data-fragment-index="3" class="fragment fade-in">Set a Git identity using
|
|
<pre data-fragment-index="3" class="fragment fade-in"><code data-fragment-index="3" class="fragment fade-in">$ git config set --global user.name "Adina Wagner"
|
|
$ git config set --global user.email "adina.wagner@t-online.de"</code></pre>
|
|
Find installation and configuration
|
|
instructions at <a href="http://handbook.datalad.org/en/latest/intro/installation.html" target="_blank">
|
|
handbook.datalad.org</a> </li></ul>
|
|
</section>
|
|
|
|
<section data-transition="None">
|
|
<h2>Using DataLad</h2>
|
|
<ul>
|
|
<div>
|
|
<li>DataLad can be used from the command line</li>
|
|
<pre><code>datalad create mydataset</code></pre></div>
|
|
<div>
|
|
<li>... or with its Python API</li>
|
|
<pre><code class="python">import datalad.api as dl
|
|
dl.create(path="mydataset")</code></pre></div>
|
|
<div class="fragment fade-in">
|
|
<li>... and other programming languages can use it via system call</li>
|
|
<pre><code class="python"># in R
|
|
> system("datalad create mydataset")
|
|
</code></pre></div>
|
|
</ul>
|
|
</section>
|
|
|
|
<section data-transition="None">
|
|
<h2>Using DataLad</h2>
|
|
<ul style="font-size:30px">
|
|
<li class="fragment fade-in">Every DataLad command consists of a main
|
|
command followed by a sub-command. The main and the sub-command can have options.
|
|
<img src="../pics/command-structure.png">
|
|
</li>
|
|
<li class="fragment fade-in"> Example (main command, subcommand, several subcommand options):
|
|
<pre><code>$ datalad save -m "Saving changes" --recursive </code></pre>
|
|
</li>
|
|
<li class="fragment fade-in">Use <em>--help</em> to find out more about any (sub)command
|
|
and its options, including detailed description and examples (<em>q</em> to close). Use <em>-h</em> to get a short
|
|
overview of all options
|
|
<pre><code>$ datalad save -h
|
|
Usage: datalad save [-h] [-m MESSAGE] [-d DATASET] [-t ID] [-r] [-R LEVELS]
|
|
[-u] [-F MESSAGE_FILE] [--to-git] [-J NJOBS] [--amend]
|
|
[--version]
|
|
[PATH ...]
|
|
|
|
Use '--help' to get more comprehensive information.
|
|
</code></pre></li>
|
|
</ul>
|
|
</section>
|
|
|
|
|
|
<section>
|
|
<h2>DataLad Datasets</h2>
|
|
|
|
<ul>
|
|
<li>DataLad's core data structure</li>
|
|
<ul>
|
|
<li>Dataset = A directory managed by DataLad</li>
|
|
<li>Any directory of your computer can be managed by DataLad.</li>
|
|
<li class="fragment fade-in">Datasets can be <i>created</i> (from scratch) or <i>installed</i></li>
|
|
<li class="fragment fade-in">Datasets can be nested: <i>linked subdirectories</i></li>
|
|
</ul>
|
|
<li class="fragment fade-in">Let's start by creating a dataset:</li>
|
|
<div class="fragment fade-in"><pre><code>$ datalad create -c text2git my-dataset</code></pre></div>
|
|
</ul>
|
|
<a class="fragment fade-in" style="font-size:25px" href="https://psychoinformatics-de.github.io/rdm-course/01-content-tracking-with-datalad/index.html#getting-started-create-an-empty-dataset" target="_blank">
|
|
Code: psychoinformatics-de.github.io/rdm-course/01-content-tracking-with-datalad/index.html#getting-started-create-an-empty-dataset
|
|
</a>
|
|
<aside class="notes">
|
|
<li>anything can be managed: CV, website, music library, phd</li>
|
|
<li>show this on the manuscript repo: history, looks/feels</li>
|
|
</aside>
|
|
</section>
|
|
|
|
<section data-transition="None">
|
|
<h2>DataLad Datasets</h2>
|
|
A DataLad dataset is a joined Git + git-annex repository
|
|
<img src="../pics/slides/pics/datalad_sandwhich_tuned/sandwhich03.svg">
|
|
</section>
|
|
</section>
|
|
|
|
<section>
|
|
<section data-transition="None">
|
|
<h3>What is version control?</h3>
|
|
<img height="400" src="../pics/turingway/VersionControl.svg">
|
|
<img height="400" src="../pics/turingway/ProjectHistory.svg">
|
|
<imgcredit>Illustration adapted from Scriberia and The Turing Way</imgcredit>
|
|
<ul>
|
|
<li class="fragment fade-in">keep things organized</li>
|
|
<li class="fragment fade-in">keep track of changes</li>
|
|
<li class="fragment fade-in">revert changes or go back to previous states</li>
|
|
</ul>
|
|
</section>
|
|
|
|
<section data-transition="None">
|
|
<h2>Why version control?</h2>
|
|
<img src="../pics/final.png" style="box-shadow: 10px 10px 8px #888888;height=600px" height="600"><br>
|
|
</aside>
|
|
</section>
|
|
|
|
<section>
|
|
<h2>Version Control</h2>
|
|
|
|
<ul>
|
|
<li>DataLad knows two things: Datasets and files</li>
|
|
<img class="fragment fade-in" data-fragment-index="1" style="box-shadow: 5px 5px 3px #888888" src="../pics/artwork/src/dataset.svg" height="330"> <img style="box-shadow: 5px 5px 3px #888888" height="330" class="fragment fade-in" data-fragment-index="2" src="../pics/artwork/src/local_wf.svg">
|
|
</ul><br>
|
|
<li class="fragment fade-in">
|
|
Every file you put into a in a dataset can be easily version-controlled,
|
|
regardless of size, with the same command: <em>datalad save</em> </li>
|
|
</section>
|
|
|
|
|
|
<section>
|
|
<h2>Local version control</h2>
|
|
|
|
<p>Procedurally, version control is easy with DataLad!</p>
|
|
<img class="fragment fade-in" src="../pics/local_wf.svg" height="500"> <!-- .element: class="fragment" -->
|
|
<br>
|
|
|
|
<b class="fragment fade-in">Advice:</b>
|
|
<ul>
|
|
<li class="fragment fade-in">Save <i>meaningful</i> units of change</li>
|
|
<li class="fragment fade-in">Attach helpful commit messages</li>
|
|
</ul>
|
|
</section>
|
|
|
|
<section data-markdown><script type="text/template" >
|
|
|
|
### This means: You can also version control data! <!-- .element: class="fragment" -->
|
|
|
|
<pre><code class="bash" style="max-height:none">$ datalad save \
|
|
-m "Adding raw data from neuroimaging study 1" \
|
|
sub-*
|
|
add(ok): sub-1/anat/T1w.json (file)
|
|
add(ok): sub-1/anat/T1w.nii.gz (file)
|
|
add(ok): sub-1/anat/T2w.json (file)
|
|
add(ok): sub-1/anat/T2w.nii.gz (file)
|
|
add(ok): sub-1/func/sub-1-run-1_bold.json (file)
|
|
add(ok): sub-1/func/sub-1-run-1_bold.nii.gz (file)
|
|
add(ok): sub-10/anat/T1w.json (file)
|
|
add(ok): sub-10/anat/T1w.nii.gz (file)
|
|
add(ok): sub-10/anat/T2w.json (file)
|
|
add(ok): sub-10/anat/T2w.nii.gz (file)
|
|
[110 similar messages have been suppressed]
|
|
save(ok): . (dataset)
|
|
action summary:
|
|
add (ok: 120)
|
|
save (ok: 1)
|
|
</code></pre> <!-- .element: class="fragment" -->
|
|
|
|
</script>
|
|
</section>
|
|
|
|
<section data-markdown><script type="text/template" >
|
|
## Version Control
|
|
* Your dataset can be a complete research log, capturing everything that was done, when, by whom, and how
|
|

|
|
* Interact with the history:
|
|
* reset your dataset (or subset of it) to a previous state,
|
|
* throw out changes or bring them back,
|
|
* find out what was done when, how, why, and by whom
|
|
* Identify precise versions: Use data in the most recent version, or the one from 2018, or...
|
|
* ...
|
|
</script>
|
|
</section>
|
|
|
|
|
|
<section>
|
|
<h2>Preview: Start to record provenance</h2>
|
|
<ul>
|
|
<li>
|
|
Have you ever saved a PDF to read later onto your computer, but forgot
|
|
where you got it from?
|
|
</li>
|
|
<li class="fragment fade-in">
|
|
Digital Provenance = <i>"The tools and processes used to create a
|
|
digital file, the responsible entity, and when and where the process
|
|
events occurred"</i>
|
|
</li>
|
|
<li class="fragment fade-in">
|
|
The history of a dataset already contains provenance, but there is more
|
|
to record - for example: Where does a file come from?
|
|
<code>datalad download-url</code> is helpful
|
|
</li>
|
|
</ul>
|
|
</section>
|
|
|
|
<section>
|
|
<h3>Summary - Local version control</h3>
|
|
|
|
<dl>
|
|
<dt class="fragment fade-in"><code>datalad create</code> creates an empty dataset.</dt> <dd class="fragment fade-in">Configurations (<b>-c yoda</b>, <b>-c text2git</b>) are useful (details soon).</dd>
|
|
<br>
|
|
<dt class="fragment fade-in">A dataset has a <i>history</i> to track files and their modifications. </dt><dd class="fragment fade-in">Explore it with Git (<b>git log</b>) or external tools (e.g., <b>tig</b>).</dd>
|
|
<br>
|
|
<dt class="fragment fade-in"><code>datalad save</code> records the dataset or file state to the history. </dt><dd class="fragment fade-in">Concise <b>commit messages</b> should summarize the change for future you and others.</dd>
|
|
<br>
|
|
<dt class="fragment fade-in"><code>datalad download-url</code> obtains web content and records its origin. </dt><dd class="fragment fade-in">It even takes care of saving the change.</dd>
|
|
<br>
|
|
<dt class="fragment fade-in"><code>datalad status</code> reports the current state of the dataset.</dt>
|
|
<dd class="fragment fade-in">A clean dataset status (no modifications, not untracked files) is good practice.</dd>
|
|
</dl>
|
|
</section>
|
|
|
|
|
|
<section>
|
|
<h2>Questions!</h2>
|
|
<small>Awkward silence can be bridged with awkward MC questions :) </small>
|
|
<iframe src="https://www.directpoll.com/r?XDbzPBd3ixYqg8huKIwKuJ7aj5lQw7fByQ4HgMgN",
|
|
style="border: 0", width="930", height="900"></iframe>
|
|
</section>
|
|
</section>
|
|
|
|
<section>
|
|
<section>
|
|
<h2>Teaser: Time-travelling</h2>
|
|
<small>Comprehensive walk-through<a href="http://handbook.datalad.org/en/lastest/basics/101-137-history.html" target="_blank">
|
|
handbook.datalad.org/basics/101-137-history.html
|
|
</a></small>
|
|
<ul style="font-size:30px">
|
|
<li>Mistakes are not forever anymore: Past changes can transparently be undone</li>
|
|
<li>Become a time-bender: Travel back in time or rewrite history</li>
|
|
<li class="fragment fade-in">Prerequisite: Understand Git IDs and "refs"</li>
|
|
<ul>
|
|
<li class="fragment fade-in">Commit hash/Commit SHA: A 40-character string identifying each commit</li>
|
|
<li class="fragment fade-in">Branch names, e.g., <em>main</em></li>
|
|
<li class="fragment fade-in">Tags, e.g., <em>v.0.1</em></li>
|
|
<li class="fragment fade-in">A pointer to the checked-out (current) commit on the current branch, <em>HEAD</em></li>
|
|
</ul>
|
|
</ul>
|
|
<img class="fragment fade-in" src="../pics/commit-ref.png"><br>
|
|
<a class="fragment fade-in" style="font-size:25px" href="https://psychoinformatics-de.github.io/rdm-course/01-content-tracking-with-datalad/index.html#getting-started-create-an-empty-dataset" target="_blank">
|
|
Code: psychoinformatics-de.github.io/rdm-course/01-content-tracking-with-datalad/index.html#breaking-things-and-repairing-them
|
|
</a>
|
|
</section>
|
|
|
|
<section>
|
|
<h2>Summary: Interacting with Git's history (teaser)</h2>
|
|
<dl>
|
|
<dt class="fragment fade-in">Interactions with Git's history require Git commands, but are immensely powerful</dt><dd class="fragment fade-in">More in <a href="http://handbook.datalad.org/en/latest/basics/101-137-history.html" target="_blank">
|
|
handbook.datalad.org/basics/101-137-history.html
|
|
</a></dd>
|
|
<br>
|
|
<dt class="fragment fade-in"><code>git restore</code> is a dangerous (!), but sometimes useful command:</dt>
|
|
<dd class="fragment fade-in"> It removes unsaved modifications to restore files to a past, saved state. What has been removed by it can not be brought back to life!</dd>
|
|
<br>
|
|
<dt class="fragment fade-in"><code>git revert [hash]</code> transparently undoes a past commit</dt><dd class="fragment fade-in">It will create a new entry in the revision history about this.</dd>
|
|
<br>
|
|
<dt class="fragment fade-in">Commands that will be introduced later:</dt>
|
|
<dd class="fragment fade-in"><code>git checkout</code> lets you time-travel.</dd>
|
|
<dt class="fragment fade-in">Commands that are out of scope but useful to know:</dt>
|
|
<dd class="fragment fade-in"><code>git rebase</code> changes and <code>git reset</code> rewinds history without creating a commit about it (see Handbook chapter for examples).</dd>
|
|
<dt class="fragment fade-in">A life-saver that is not well-known: <code>git reflog</code></dt><dd class="fragment fade-in">A time-limited backlog of every past performed action, can undo every mistake except <code>git restore</code> and <code>git clean</code>.</dd>
|
|
</dl>
|
|
</section>
|
|
|
|
<section>
|
|
<h2>Questions!</h2>
|
|
<small>Awkward silence can be bridged with awkward MC questions :) </small>
|
|
<iframe src="https://www.directpoll.com/r?XDbzPBd3ixYqg8huKIwKuJ7aj5lQw7fByQ4HgMgN",
|
|
style="border: 0", width="930", height="900"></iframe>
|
|
</section>
|
|
</section>
|
|
|
|
<section>
|
|
<section>
|
|
<h2>A look underneath the hood</h2>
|
|
<h4>(In-depth explanations how and why things work, with plenty of teasers to additional features)</h4>
|
|
</section>
|
|
|
|
<section data-transition="None" style="vertical-align:top">
|
|
<h3>There are two version control tools at work - why?</h3>
|
|
<p class="fragment fade-in">Git does not handle large files well.
|
|
<div class="r-stack">
|
|
<img class="fragment" src="../pics/gitsnapshot.png">
|
|
</div>
|
|
</p>
|
|
</section>
|
|
|
|
<section data-transition="None">
|
|
<h3>There are two version control tools at work - why?</h3>
|
|
<p>Git does not handle large files well.
|
|
<img src="../pics/gitsnapshot2.png">
|
|
</p>
|
|
<p class="fragment fade-in">
|
|
And repository hosting services refuse to handle large files:
|
|
<img src="../pics/pushing_large_files_to_Git.png"></p>
|
|
<p style="z-index: 100;position: fixed; font-size:35px;margin-top:-450px;margin-bottom:300px;margin-left:1000px">
|
|
<img class="fragment" src="../pics/horrofied.png" height="380px"></p>
|
|
<p class="fragment fade-in">git-annex to the rescue! Let's take a look how it works</p>
|
|
</section>
|
|
|
|
<section data-markdown><script type="text/template" >
|
|
## Consuming datasets
|
|
* A dataset can be created from scratch/existing directories:
|
|
<pre><code class="bash" style="max-height:none">$ datalad create mydataset
|
|
[INFO] Creating a new annex repo at /home/adina/mydataset
|
|
create(ok): /home/adina/mydataset (dataset)
|
|
</code></pre>
|
|
* but datasets can also be installed from paths or from URLs:
|
|
<pre><code class="bash" style="max-height:none">$ datalad clone https://github.com/datalad-datasets/human-connectome-project-openaccess HCP
|
|
install(ok): /tmp/HCP (dataset)
|
|
</code></pre>
|
|
<small>Hint: Did you know that you can get the <a href="https://github.com/datalad-datasets/human-connectome-project-openaccess" target="_blank"> Human Connectome Project Open Access Data </a> as a Dataset?</small>
|
|
</script>
|
|
</section>
|
|
|
|
<section data-transition="None">
|
|
<h2>Consuming datasets</h2>
|
|
|
|
<ul>
|
|
<li class="fragment fade-in">Here's how to get a dataset:</li>
|
|
<img class="fragment fade-in" src="../pics/clonedata.gif" height="700">
|
|
|
|
</ul>
|
|
</section>
|
|
<section data-transition="None">
|
|
<h2>Consuming datasets</h2>
|
|
|
|
<ul>
|
|
<li>Here's how a dataset looks after installation:</li>
|
|
<img class="fragment fade-in" src="../pics/getdata.gif" height="700">
|
|
|
|
</ul>
|
|
</section>
|
|
|
|
<section data-transition="None">
|
|
<h2>Plenty of data, but little disk-usage</h2>
|
|
<ul>
|
|
<li class="fragment fade-in-then-semi-out">Cloned datasets are lean.
|
|
"Meta data" (file names, availability) are present, but <b>no file content</b>:</li>
|
|
<pre class="fragment fade-in"><code>$ datalad clone git@github.com:psychoinformatics-de/studyforrest-data-phase2.git
|
|
install(ok): /tmp/studyforrest-data-phase2 (dataset)
|
|
$ cd studyforrest-data-phase2 && du -sh
|
|
18M .</code></pre>
|
|
|
|
<li class="fragment fade-in-then-semi-out"> files' contents can be retrieved on demand:</li>
|
|
</ul>
|
|
<pre class="fragment fade-in"><code>$ datalad get sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz
|
|
get(ok): /tmp/studyforrest-data-phase2/sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz (file) [from mddatasrc...]</code></pre>
|
|
|
|
<li class="fragment fade-in">Have more access to your computer than you have disk-space:</li>
|
|
<pre class="fragment fade-in"><code># eNKI dataset (1.5TB, 34k files):
|
|
$ du -sh
|
|
1.5G .
|
|
# HCP dataset (~200TB, >15 million files)
|
|
$ du -sh
|
|
48G . </code></pre>
|
|
</section>
|
|
|
|
<section data-markdown data-transition="None"> <script type="text/template">
|
|
## Plenty of data, but little disk-usage
|
|
|
|
Drop file content that is not needed:<!-- .element: class="fragment fade-in" -->
|
|
<pre class="fragment fade-in-then-semi-out"><code>$ datalad drop sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz
|
|
drop(ok): /tmp/studyforrest-data-phase2/sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz (file) [checking https://arxiv.org/pdf/0904.3664v1.pdf...]</code></pre>
|
|
When files are dropped, only "meta data" stays behind, and they can be re-obtained on demand.<!-- .element: class="fragment fade-in" -->
|
|
<pre><code class="python">dl.get('input/sub-01')
|
|
[really complex analysis]
|
|
dl.drop('input/sub-01')
|
|
</code></pre><!-- .element: class="fragment fade-in" -->
|
|
</script></section>
|
|
|
|
<section>
|
|
<h2>Git versus Git-annex</h2>
|
|
<dl>
|
|
<dt>Data in datasets is either stored in Git or git-annex</dt>
|
|
<dd>By default, everything is <i>annexed</i>, i.e., stored in a dataset annex by git-annex</dd><br>
|
|
|
|
<br>
|
|
<small>
|
|
<table>
|
|
<tr>
|
|
<td><b>Git</b></td>
|
|
<td><b>git-annex</b></td>
|
|
</tr>
|
|
<tr>
|
|
<td>handles <b>small</b> files well (text, code)</td>
|
|
<td>handles <b>all</b> types and sizes of files well</td>
|
|
</tr>
|
|
<tr>
|
|
<td>file contents are in the Git history
|
|
and will be <b>shared</b> upon git/datalad push</td>
|
|
<td>file contents are in the annex. Not necessarily shared</td>
|
|
</tr>
|
|
<tr>
|
|
<td>Shared with every dataset clone</td>
|
|
<td><b>Can be kept private</b> on a per-file level when sharing the dataset</td>
|
|
</tr>
|
|
<tr>
|
|
<td>Useful: Small, non-binary, frequently modified, need-to-be-accessible (DUA, README) files </td>
|
|
<td>Useful: Large files, private files</td>
|
|
</tr>
|
|
</table>
|
|
</small>
|
|
<br><br>
|
|
</dl>
|
|
</section>
|
|
|
|
<section>
|
|
<h2>Git versus Git-annex</h2>
|
|
<small>Useful background information for demo later. Read
|
|
<a href="http://handbook.datalad.org/en/latest/basics/101-115-symlinks.html" target="_blank">
|
|
this handbook chapter</a> for details
|
|
</a> </small><br>
|
|
Git and Git-annex handle files differently: annexed files are stored in an annex.
|
|
File content is hashed & only content-identity is committed to Git.
|
|
<ul>
|
|
<table>
|
|
<tr>
|
|
<td>
|
|
<li>Files stored in Git are modifiable, files stored in Git-annex are content-locked</li>
|
|
</td>
|
|
<td width="60%">
|
|
<img src="../pics/git_vs_gitannex.svg" height="500">
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
|
|
<li>Annexed contents are not available right after cloning,
|
|
only content identity and availability information (as they are stored in Git).
|
|
Everything that is annexed needs to be retrieved with <code>datalad get</code> from whereever it is stored.
|
|
</li>
|
|
</ul>
|
|
</section>
|
|
|
|
<section>
|
|
<h2>Git versus Git-annex</h2>
|
|
<img height="500" src="../pics/artwork/src/publishing/publishing_gitvsannex.svg">
|
|
</section>
|
|
|
|
<section>
|
|
<h2>Git versus Git-annex</h2>
|
|
<ul>
|
|
When sharing datasets with someone without access to the same computational
|
|
infrastructure, annexed data is not necessarily stored together with the rest
|
|
of the dataset (more in the <b>session on publishing</b>).
|
|
</ul>
|
|
<img src="../pics/services_connected.png" height="500">
|
|
<ul>
|
|
Transport logistics exist to interface with all major storage providers.
|
|
If the one you use isn't supported, let us know!
|
|
</ul>
|
|
</section>
|
|
|
|
|
|
<section>
|
|
<h2>Git versus Git-annex</h2>
|
|
<ul>
|
|
Users can decide which files are annexed:
|
|
<br><br>
|
|
<li><b>Pre-made run-procedures</b>, provided by DataLad (e.g., <code>text2git</code>, <code>yoda</code>)
|
|
or created and shared by users
|
|
(<a href="http://handbook.datalad.org/en/latest/basics/101-124-procedures.html" target="_blank">Tutorial</a>) </li>
|
|
<li>Self-made configurations in <code>.gitattributes</code> (e.g., based on file type,
|
|
file/path name, size, ...; <a href="http://handbook.datalad.org/en/latest/basics/101-123-config2.html#gitattributes" target="_blank">
|
|
rules and examples
|
|
</a> )</li>
|
|
<li>Per-command basis (e.g., via <code>datalad save --to-git</code>)</li>
|
|
</ul>
|
|
</section>
|
|
|
|
|
|
<section>
|
|
<h2><em>text2git</em>Text versus binary files</h2>
|
|
<iframe src="https://www.directpoll.com/r?XDbzPBd3ixYqg8huKIwKuJ7aj5lQw7fByQ4HgMgN",
|
|
style="border: 0", width="930", height="900"></iframe>
|
|
<small>An overview of text- versus binary files and implications for version control is in
|
|
<a href="https://psychoinformatics-de.github.io/rdm-course/02-structuring-data/index.html#file-types-text-vs-binary" target="_blank">
|
|
psychoinformatics-de.github.io/rdm-course/02-structuring-data/index.html#file-types-text-vs-binary
|
|
</a> </small>
|
|
</section>
|
|
|
|
|
|
<section data-transition="None">
|
|
<h2>Disk-space aware workflows</h2>
|
|
<ul>
|
|
<li class="fragment fade-in-then-semi-out"> Clone the input data:</li>
|
|
<pre class="fragment fade-in"><code>$ datalad clone git@github.com:datalad-datasets/machinelearning-books.git
|
|
install(ok): /tmp/machinelearning-books (dataset)
|
|
$ cd machinelearning-books && du -sh
|
|
348K .</code></pre>
|
|
<pre class="fragment fade-in"><code>$ ls
|
|
A.Shashua-Introduction_to_Machine_Learning.pdf
|
|
B.Efron_T.Hastie-Computer_Age_Statistical_Inference.pdf
|
|
C.E.Rasmussen_C.K.I.Williams-Gaussian_Processes_for_Machine_Learning.pdf
|
|
D.Barber-Bayesian_Reasoning_and_Machine_Learning.pdf
|
|
[...]</code></pre>
|
|
<li class="fragment fade-in-then-semi-out"> retrieve annexed file's contents on demand:</li>
|
|
<pre class="fragment fade-in"><code>$ datalad get A.Shashua-Introduction_to_Machine_Learning.pdf
|
|
get(ok): /tmp/machinelearning-books/A.Shashua-Introduction_to_Machine_Learning.pdf (file) [from web...]</code></pre>
|
|
<li class="fragment fade-in-then-semi-out"> Drop annexed file's contents when done:</li>
|
|
|
|
<pre class="fragment fade-in-then-semi-out"><code>$ datalad drop A.Shashua-Introduction_to_Machine_Learning.pdf
|
|
drop(ok): /tmp/machinelearning-books/A.Shashua-Introduction_to_Machine_Learning.pdf (file) [checking https://arxiv.org/pdf/0904.3664v1.pdf...]</code></pre>
|
|
</ul>
|
|
<aside class="notes">
|
|
Idea behind datalad: Enable a similar level of tooling and culture for the distribution and version control of data as it is present for open source software development
|
|
</aside>
|
|
</section>
|
|
|
|
<section>
|
|
<h2>Distributed availability</h2>
|
|
<ul style="font-size:30px">
|
|
<li class="fragment fade-in" data-fragment-index="1">git-annex conceptualizes file availability information as a decentral network.
|
|
A file can exist in multiple different locations. <em>git annex whereis</em>
|
|
tells you which are known:</li>
|
|
<pre class="fragment fade-in" data-fragment-index="1"><code class="fragment fade-in" data-fragment-index="1">$ git annex whereis inputs/images/chinstrap_02.jpg
|
|
whereis inputs/images/chinstrap_02.jpg (1 copy)
|
|
00000000-0000-0000-0000-000000000001 -- web
|
|
c1bfc615-8c2b-4921-ab33-2918c0cbfc18 -- adina@muninn:/tmp/my-dataset [here]
|
|
|
|
web: https://unsplash.com/photos/8PxCm4HsPX8/download?force=true
|
|
ok
|
|
</code></pre>
|
|
<li class="fragment fade-in" data-fragment-index="2">
|
|
If a file has no other known storage locations, <em>drop</em> will warn
|
|
</li>
|
|
<ul style="font-size:25px">
|
|
<li class="fragment fade-in" data-fragment-index="3">Here is a file with a registered remote location (the web)</li>
|
|
<pre class="fragment fade-in" data-fragment-index="3"><code class="fragment fade-in" data-fragment-index="3">$ datalad drop inputs/images/chinstrap_02.jpg
|
|
drop(ok): /home/my-dataset/inputs/images/chinstrap_02.jpg (file)
|
|
$ datalad get inputs/images/chinstrap_02.jpg
|
|
get(ok): inputs/images/chinstrap_02.jpg (file)
|
|
</code></pre>
|
|
<li class="fragment fade-in" data-fragment-index="3">Here is a file without a registered remote location (the web)
|
|
</li>
|
|
<pre class="fragment fade-in" data-fragment-index="3"><code class="fragment fade-in" data-fragment-index="3">$ datalad drop inputs/images/chinstrap_01.jpg
|
|
drop(error): inputs/images/chinstrap_01.jpg (file)
|
|
[unsafe; Could only verify the existence of 0 out of 1 necessary copy;
|
|
(Use --reckless availability to override this check, or adjust numcopies.)]</code></pre>
|
|
</ul>
|
|
<li class="fragment fade-in" data-fragment-index="4">Delineation and advantages of decentral versus central RDM:<a href="https://doi.org/10.1515/nf-2020-0037" target="_blank">
|
|
In defense of decentralized research data management</a>
|
|
|
|
</ul>
|
|
</section>
|
|
|
|
<section>
|
|
<h2>Data protection</h2>
|
|
Why are annexed contents write-protected? (part I) <br><br>
|
|
<ul style="font-size:30px">
|
|
<li>Where the filesystem allows it, annexed files are symlinks:
|
|
<pre><code>$ ls -l inputs/images/chinstrap_01.jpg
|
|
lrwxrwxrwx 1 adina adina 132 Apr 5 20:53 inputs/images/chinstrap_01.jpg -> ../../.git/annex/objects/1z/
|
|
xP/MD5E-s725496--2e043a5654cec96aadad554fda2a8b26.jpg/MD5E-s725496--2e043a5654cec96aadad554fda2a8b26.jpg
|
|
</code></pre><small>(PS: especially useful in datasets with many identical files) </small></li>
|
|
<li>The symlink reveals git-annex internal data organization based on identity hash:
|
|
<pre><code>$ md5sum inputs/images/chinstrap_01.jpg
|
|
2e043a5654cec96aadad554fda2a8b26 inputs/images/chinstrap_01.jpg
|
|
</code></pre></li>
|
|
<li class="fragment fade-in">git-annex write-protects files to keep this symlink functional -
|
|
Changing file contents without git-annex knowing would make the hash change and the symlink point to nothing</li>
|
|
<li class="fragment fade-in">To (temporarily) remove the write-protection one can <em>unlock</em> the file</li>
|
|
</ul>
|
|
</section>
|
|
|
|
<section data-transition="fade">
|
|
<h2>Detour & Teaser: Reproducible data analysis</h2>
|
|
Your past self is the worst collaborator:
|
|
<img src="../pics/ownlegacycode_phd.png" height="500">
|
|
<imgcredit>Full comic at <a href="http://phdcomics.com/comics.php?f=1689">http://phdcomics.com/comics.php?f=1979</a></imgcredit>
|
|
</p>
|
|
<small>Code: <a href="https://psychoinformatics-de.github.io/rdm-course/01-content-tracking-with-datalad/index.html#data-processing" target="_blank">
|
|
psychoinformatics-de.github.io/rdm-course/01-content-tracking-with-datalad/index.html#data-processing</a> </small>
|
|
</section>
|
|
|
|
|
|
<section data-transition="None">
|
|
<h2>Reproducible execution & provenance capture</h2>
|
|
|
|
<p style="font-size:30px"><em>datalad run</em> wraps a command execution and records its impact on a dataset.
|
|
<img class="fragment fade-in" src="../pics/run_prov_0.svg">
|
|
</section>
|
|
|
|
<section data-transition="None">
|
|
<h2>Reproducible execution & provenance capture</h2>
|
|
|
|
<p style="font-size:30px"><em>datalad run</em> wraps a command execution and records its impact on a dataset.
|
|
<pre style="max-height:none"><code style="max-height:none">commit 9fbc0c18133aa07b215d81b808b0a83bf01b1984 (HEAD -> main)
|
|
Author: Adina Wagner [adina.wagner@t-online.de]
|
|
Date: Mon Apr 18 12:31:47 2022 +0200
|
|
|
|
[DATALAD RUNCMD] Convert the second image to greyscale
|
|
|
|
=== Do not change lines below ===
|
|
{
|
|
"chain": [],
|
|
"cmd": "python code/greyscale.py inputs/images/chinstrap_02.jpg outputs/im>
|
|
"dsid": "418420aa-7ab7-4832-a8f0-21107ff8cc74",
|
|
"exit": 0,
|
|
"extra_inputs": [],
|
|
"inputs": [],
|
|
"outputs": [],
|
|
"pwd": "."
|
|
}
|
|
^^^ Do not change lines above ^^^
|
|
|
|
diff --git a/outputs/images_greyscale/chinstrap_02_grey.jpg b/outputs/images_gr>
|
|
new file mode 120000
|
|
index 0000000..5febc72
|
|
--- /dev/null
|
|
+++ b/outputs/images_greyscale/chinstrap_02_grey.jpg
|
|
@@ -0,0 +1 @@
|
|
+../../.git/annex/objects/19/mp/MD5E-s758168--8e840502b762b2e7a286fb5770f1ea69.>
|
|
\ No newline at end of file
|
|
</code></pre>
|
|
<p style="font-size:30px">The resulting commit's hash (or any other identifier) can be used
|
|
to automatically re-execute a computation (more on this tomorrow)</p> <!-- .element: class="fragment" -->
|
|
</section>
|
|
|
|
|
|
<section data-transition="None">
|
|
<h2>Data protection</h2>
|
|
Why are annexed contents write-protected? (part 2) <br><br>
|
|
<ul style="font-size:30px">
|
|
<li>When you try to modify an annexed file without unlocking you will see
|
|
"Permission denied" errors.
|
|
<pre><code>Traceback (most recent call last):
|
|
File "/home/bob/Documents/rdm-warmup/example-dataset/code/greyscale.py", line 20, in module
|
|
grey.save(args.output_file)
|
|
File "/home/bob/Documents/rdm-temporary/venv/lib/python3.9/site-packages/PIL/Image.py", line 2232, in save
|
|
fp = builtins.open(filename, "w+b")
|
|
PermissionError: [Errno 13] Permission denied: 'outputs/images_greyscale/chinstrap_02_grey.jpg'
|
|
</code></pre></li>
|
|
<li class="fragment fade-in">Use <em>datalad unlock</em> to make the file modifiable.
|
|
Underneath the hood (given the file system initially supported symlinks), this removes the symlink:
|
|
<pre><code>$ datalad unlock outputs/images_greyscale/chinstrap_02_grey.jpg
|
|
$ ls outputs/images_greyscale/chinstrap_02_grey.jpg
|
|
-rw-r--r-- 1 adina adina 758168 Apr 18 12:31 outputs/images_greyscale/chinstrap_02_grey.jpg</code></pre></li>
|
|
<li class="fragment fade-in"><em>datalad save</em> locks the file again.
|
|
Locking and unlocking ensures that git-annex always finds the right version of a file.</li>
|
|
</ul>
|
|
</section>
|
|
|
|
<section data-transition="None">
|
|
<h2>Reproducible execution & provenance capture</h2>
|
|
|
|
<p style="font-size:30px"><em>datalad run</em> wraps a command execution and records its impact on a dataset.
|
|
<br><strong>In addition, it can take care of data retrieval and unlocking</strong></p>
|
|
<img class="fragment fade-in" src="../pics/run_prov.svg" height="600"> <!-- .element: class="fragment" -->
|
|
</section>
|
|
|
|
<section>
|
|
<h2>datalad rerun</h2>
|
|
<ul style="font-size:30px">
|
|
<li>
|
|
<code>datalad rerun</code> is helpful to spare others and yourself
|
|
the short- or long-term memory task, or the forensic skills to figure
|
|
out how you performed an analysis
|
|
</li>
|
|
<li>
|
|
But it is also a digital and machine-reable provenance record
|
|
</li>
|
|
<li>
|
|
Important: The better the run command is specified, the better the
|
|
provenance record
|
|
</li>
|
|
<li>
|
|
Note: run and rerun only create an entry in the history if the command execution
|
|
leads to a change.
|
|
</li>
|
|
<br><br>
|
|
<li class="fragment fade-in">Task: Use <code>datalad rerun</code> to rerun the script execution.
|
|
Find out if the output changed</li>
|
|
</ul>
|
|
</section>
|
|
|
|
<section>
|
|
<h3>Summary - Underneath the hood</h3>
|
|
|
|
<ul style="font-size:30px">
|
|
<dt class="fragment fade-in">Files are either kept in Git or in git-annex.</dt>
|
|
<dd class="fragment fade-in"><em>datalad save</em> is used for both, but configurations (e.g., <em>text2git</em>), dataset rules
|
|
(e.g., in a <em>.gitattributes</em> file, or flags change the default behavior
|
|
of annexing everything</dd>
|
|
<br>
|
|
<dt class="fragment fade-in">Annexed files behave differently from files kept in Git:</dt>
|
|
<dd class="fragment fade-in">They can be retrieved and dropped from local or remote locations, they are write-protected,
|
|
their content is unkown to Git (and thus easy to keep private).</dd>
|
|
<br>
|
|
<dt class="fragment fade-in"><em>datalad clone</em> installs datasets from URLs or local or remote paths</dt>
|
|
<dd class="fragment fade-in">Annexed files contents can be retrieved or dropped on demand, file contents of
|
|
files stored in Git are available right away.</dd>
|
|
<br>
|
|
<dt class="fragment fade-in"><em>datalad unlock</em> makes annexed files modifiable, <em>datalad save</em>
|
|
locks them again.</dt>
|
|
<dd class="fragment fade-in">(It is generally easier to get accidentally saved files out of the annex than out of Git -
|
|
see <a href="http://handbook.datalad.org/en/latest/basics/101-136-filesystem.html" target="_blank">handbook.datalad.org/basics/101-136-filesystem.html</a> for examples) </dd>
|
|
<br>
|
|
<dt class="fragment fade-in"><em>datalad run</em> records the impact of any command execution in
|
|
a dataset. </dt>
|
|
<dd class="fragment fade-in">Data/directories specified as <code>--input</code>
|
|
are retrieved prior to command execution, data/directories specified as <code>--output</code> unlocked.</dd>
|
|
<br>
|
|
<dt class="fragment fade-in"><code>datalad rerun</code> can automatically re-execute run-records later.</dt>
|
|
<dd class="fragment fade-in">They can be identified with any commit-ish (hash, tag, range, ...)</dd>
|
|
|
|
</ul>
|
|
</section>
|
|
|
|
|
|
<section>
|
|
<h2>Questions!</h2>
|
|
<small>Awkward silence can be bridged with awkward MC questions :) </small>
|
|
<iframe src="https://www.directpoll.com/r?XDbzPBd3ixYqg8huKIwKuJ7aj5lQw7fByQ4HgMgN",
|
|
style="border: 0", width="930", height="900"></iframe>
|
|
</section>
|
|
</section>
|
|
|
|
<section>
|
|
<section>
|
|
<h2>Before we continue...</h2>
|
|
<small>Let your energy level define how we progress:</small><br>
|
|
<iframe src="https://www.directpoll.com/r?XDbzPBd3ixYqg8huKIwKuJ7aj5lQw7fByQ4HgMgN",
|
|
style="border: 0", width="930", height="900"></iframe>
|
|
</section>
|
|
</section>
|
|
|
|
|
|
|
|
</div>
|
|
</div>
|
|
|
|
<script src="../reveal.js/dist/reveal.js"></script>
|
|
<script src="../reveal.js/plugin/notes/notes.js"></script>
|
|
<script src="../reveal.js/plugin/markdown/markdown.js"></script>
|
|
<script src="../reveal.js/plugin/highlight/highlight.js"></script>
|
|
<script>
|
|
// More info about initialization & config:
|
|
// - https://revealjs.com/initialization/
|
|
// - https://revealjs.com/config/
|
|
Reveal.initialize({
|
|
hash: true,
|
|
// The "normal" size of the presentation, aspect ratio will be preserved
|
|
// when the presentation is scaled to fit different resolutions. Can be
|
|
// specified using percentage units.
|
|
width: 1280,
|
|
height: 960,
|
|
// Factor of the display size that should remain empty around the content
|
|
margin: 0.3,
|
|
// Bounds for smallest/largest possible scale to apply to content
|
|
minScale: 0.2,
|
|
maxScale: 1.0,
|
|
|
|
controls: true,
|
|
progress: true,
|
|
history: true,
|
|
center: true,
|
|
slideNumber: 'c',
|
|
pdfSeparateFragments: false,
|
|
pdfMaxPagesPerSlide: 1,
|
|
pdfPageHeightOffset: -1,
|
|
transition: 'slide', // none/fade/slide/convex/concave/zoom
|
|
// Learn about plugins: https://revealjs.com/plugins/
|
|
plugins: [ RevealMarkdown, RevealHighlight, RevealNotes ]
|
|
});
|
|
</script>
|
|
</body>
|
|
</html>
|