996 lines
40 KiB
HTML
996 lines
40 KiB
HTML
<!doctype html>
|
|
<html>
|
|
<head>
|
|
<meta charset="utf-8">
|
|
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no">
|
|
|
|
<!-- Edit me start! -->
|
|
<title>This is where your title goes</title>
|
|
<meta name="description" content=" This is where you put a short description ">
|
|
<meta name="author" content=" Your Name ">
|
|
<!-- Edit me end! -->
|
|
|
|
<link rel="stylesheet" href="../reveal.js/dist/reset.css">
|
|
<link rel="stylesheet" href="../reveal.js/dist/reveal.css">
|
|
<link rel="stylesheet" href="../reveal.js/dist/theme/beige.css">
|
|
|
|
<!-- Theme used for syntax highlighted code -->
|
|
<link rel="stylesheet" href="../reveal.js/plugin/highlight/monokai.css">
|
|
</head>
|
|
<body>
|
|
<div class="reveal">
|
|
<div class="slides">
|
|
|
|
<!--...Datalad Basics...-->
|
|
|
|
<section>
|
|
|
|
|
|
<section>
|
|
<h2>DataLad - Research Data Management made easy</h2>
|
|
|
|
<div style="margin-top:1em;text-align:center">
|
|
<table style="border: none;">
|
|
<tr>
|
|
<td style="border: none;">Adina Wagner
|
|
<br><small>
|
|
<a href="https://twitter.com/AdinaKrik" target="_blank">
|
|
<img data-src="../pics/twitter.png" style="height:30px;margin:0px" />
|
|
@AdinaKrik</a></small></td>
|
|
<td style="border: none;"><img style="height:100px;margin-right:10px" data-src="../pics/fzj_logo.svg" />
|
|
<br></td>
|
|
</tr>
|
|
<tr>
|
|
<td style="border: none;">
|
|
<small><a href="http://psychoinformatics.de" target="_blank">Psychoinformatics lab</a>,
|
|
<br> Institute of Neuroscience and
|
|
Medicine, Brain & Behavior (INM-7)<br>
|
|
Research Center Jülich</small><br>
|
|
</td>
|
|
</tr>
|
|
<td style="border: none;">Yaroslav Halchenko
|
|
<br><small>
|
|
<a href="https://twitter.com/yarikoptic" target="_blank">
|
|
<img data-src="../pics/twitter.png" style="height:30px;margin:0px" />
|
|
@yarikoptic</a></small></td>
|
|
<td style="border: none;"><img style="height:100px;margin-right:10px" data-src="../pics/dartmouth-logo.png" />
|
|
<br></td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<small><a href="http://psychoinformatics.de" target="_blank">Center for Open Neuroscience</a>,
|
|
<br> Department of Psychological and Brain Sciences<br>
|
|
Dartmouth College
|
|
</small><br>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
</div>
|
|
|
|
</a>
|
|
</section>
|
|
|
|
|
|
<section>
|
|
<h2>Live coding + hands-on</h2>
|
|
<ul>
|
|
<li>
|
|
Live-demonstration of DataLad examples and workflows
|
|
</li>
|
|
<li>
|
|
Code along with copy-paste code snippets and hands-on exercises at
|
|
<a href="http://handbook.datalad.org/r.html?Yale" target="_blank">
|
|
handbook.datalad.org/r.html?Yale</a>
|
|
|
|
</li>
|
|
<li>Requirements:
|
|
<ul>
|
|
<li>
|
|
Most recent DataLad version for your OS (installation instructions at
|
|
<a href="https://handbook.datalad.org/en/latest/intro/installation.html" target="_blank">
|
|
handbook.datalad.org</a>)
|
|
</li>
|
|
<li>
|
|
For containerized analyses: DataLad extension <a href="http://handbook.datalad.org/en/latest/extension_pkgs.html#extensions-intro" target="_blank">
|
|
datalad-containers</a> (available via pip) + <a href="https://sylabs.io/guides/3.6/user-guide/" target="_blank">
|
|
Singularity</a> or <a href="https://www.docker.com/get-started" target="_blank"> Docker</a>
|
|
</li>
|
|
</ul></li>
|
|
</ul>
|
|
</section>
|
|
</section>
|
|
|
|
<section>
|
|
<section>
|
|
<h2> <img src="../pics/datalad_logo_wide.svg"></h2>
|
|
<ul>
|
|
<li>A command-line tool, available for all major operating systems
|
|
(Linux, macOS/OSX, Windows), MIT-licensed</li>
|
|
<li>Build on top of <a href="https://git-scm.com/" target="_blank">Git</a>
|
|
and <a href="https://git-annex.branchable.com/" target="_blank">Git-annex</a></li>
|
|
<dt><li>Allows...</li></dt>
|
|
<dt>... version-controlling arbitrarily large content </dt>
|
|
<dd>version control data and software alongside to code!</dd>
|
|
<dt>... transport mechanisms for sharing and obtaining data </dt>
|
|
<dd>consume and collaborate on data (analyses) like software</dd>
|
|
<dt>... (computationally) reproducible data analysis</dt>
|
|
<dd>Track and share provenance of all digital objects</dd>
|
|
<dt>... and <i>much</i> more </dt>
|
|
<li>Completely domain-agnostic</li>
|
|
<br>
|
|
</ul>
|
|
</section>
|
|
|
|
<section data-transition="None">
|
|
<h3>
|
|
Examples of what DataLad can be used for:
|
|
</h3>
|
|
<ul>
|
|
<li class="fragment fade-in-then-semi-out"> <b>Publish or consume datasets</b> via GitHub, GitLab, OSF, or similar services</li>
|
|
<img height="850" class="fragment fade-in" src="../pics/get_hcpdata.gif" alt="a screenrecording of cloning studyforrest data from github">
|
|
</ul>
|
|
</section>
|
|
|
|
<section data-transition="None">
|
|
<h3>
|
|
Examples of what DataLad can be used for:
|
|
</h3>
|
|
<ul>
|
|
<li class="fragment fade-in-then-semi-out">
|
|
Behind-the-scenes <b>infrastructure component for data transport and versioning</b>
|
|
(e.g., used by <a href="https://openneuro.org/" target="_blank"> OpenNeuro</a>,
|
|
<a href="https://brainlife.io/" target="_blank"> brainlife.io </a>,
|
|
the <a href="https://conp.ca/" target="_blank">Canadian Open Neuroscience Platform (CONP)</a>,
|
|
<a href="https://mcin.ca/technology/cbrain/" target="_blank"> CBRAIN</a>)</li>
|
|
<img height="850" class="fragment fade-in" src="../pics/openneuro2.gif" alt="a screenrecording of browsing open neuro">
|
|
</ul>
|
|
</section>
|
|
|
|
<section data-transition="None">
|
|
<h3>
|
|
Examples of what DataLad can be used for:
|
|
</h3>
|
|
<ul>
|
|
<li class="fragment fade-in-then-semi-out"> <b>Creating and sharing reproducible, open science</b>: Sharing data, software, code, and provenance </li>
|
|
<img height="850" class="fragment fade-in" src="../pics/shareresearch2.gif" alt="a screenrecording of cloning REMODNAV paper dataset from github">
|
|
</ul>
|
|
|
|
</section>
|
|
|
|
<section data-transition="None">
|
|
<h3>
|
|
Examples of what DataLad can be used for:
|
|
</h3>
|
|
<ul>
|
|
<li> <b>Creating and sharing reproducible, open science</b>: Sharing data, software, code, and provenance </li>
|
|
<img height="850" class="fragment fade-in" src="../pics/openscience.gif" alt="a screenrecording of cloning REMODNAV paper dataset from github">
|
|
</ul>
|
|
|
|
</section>
|
|
|
|
<section data-transition="None">
|
|
<h3>
|
|
Examples of what DataLad can be used for:
|
|
</h3>
|
|
<ul>
|
|
<li class="fragment fade-in-then-semi-out"><b>Central data management</b> and archival system</li>
|
|
<img height="850" class="fragment fade-in" src="../pics/centralmanagement.gif">
|
|
</ul>
|
|
</section>
|
|
|
|
|
|
<section>
|
|
<h2>Prerequisites: Installation and Configuration</h2>
|
|
<div class="fragment fade-in">
|
|
<li>Your installed version of DataLad should be recent</li>
|
|
<pre><code>datalad --version
|
|
0.14.0</code></pre></div>
|
|
<div class="fragment fade-in">
|
|
<li>You should have a configured Git identity</li>
|
|
<pre><code class="bash">$ git config --list
|
|
user.name=Adina Wagner
|
|
user.email=adina.wagner@t-online.de
|
|
[...]
|
|
</code></pre></div>
|
|
<div class="fragment fade-in">Else, find installation and configuration
|
|
instructions at <a href="http://handbook.datalad.org/en/latest/intro/installation.html" target="_blank">
|
|
handbook.datalad.org</a> </div>
|
|
</section>
|
|
|
|
<section>
|
|
<h2>Using DataLad</h2>
|
|
|
|
<ul>
|
|
<div>
|
|
<li>DataLad can be used from the command line</li>
|
|
<pre><code>datalad create mydataset</code></pre></div>
|
|
<div>
|
|
<li>... or with its Python API</li>
|
|
<pre><code class="python">import datalad.api as dl
|
|
dl.create(path="mydataset")</code></pre></div>
|
|
<div class="fragment fade-in">
|
|
<li>... and other programming languages can use it via system call</li>
|
|
<pre><code class="python"># in R
|
|
> system("datalad create mydataset")
|
|
</code></pre></div>
|
|
</ul>
|
|
</ul>
|
|
</section>
|
|
|
|
|
|
<section>
|
|
<h2>DataLad Datasets</h2>
|
|
|
|
<ul>
|
|
<li>DataLad's core data structure</li>
|
|
<ul>
|
|
<li>Dataset = A directory managed by DataLad</li>
|
|
<li>Any directory of your computer can be managed by DataLad.</li>
|
|
<li class="fragment fade-in">Datasets can be <i>created</i> (from scratch) or <i>installed</i></li>
|
|
<li class="fragment fade-in">Datasets can be nested: <i>linked subdirectories</i></li>
|
|
</ul>
|
|
</ul>
|
|
|
|
<aside class="notes">
|
|
<li>anything can be managed: CV, website, music library, phd</li>
|
|
<li>show this on the manuscript repo: history, looks/feels</li>
|
|
</aside>
|
|
</section>
|
|
|
|
<section>
|
|
<h2>DataLad Datasets</h2>
|
|
A DataLad dataset is a joined Git + git-annex repository
|
|
<img src="../pics/slides/pics/datalad_sandwhich_tuned/sandwhich03.svg">
|
|
</section>
|
|
|
|
<section>
|
|
<h2>Version Control</h2>
|
|
|
|
<ul>
|
|
<li>DataLad knows two things: Datasets and files</li>
|
|
<img class="fragment fade-in" data-fragment-index="1" style="box-shadow: 5px 5px 3px #888888" src="../pics/artwork/src/dataset.svg" height="330"> <img style="box-shadow: 5px 5px 3px #888888" height="330" class="fragment fade-in" data-fragment-index="2" src="../pics/artwork/src/local_wf.svg">
|
|
</ul><br>
|
|
<li class="fragment fade-in">
|
|
Every file you put into a in a dataset can be easily version-controlled,
|
|
regardless of size, with the same command. </li>
|
|
</section>
|
|
|
|
|
|
<section>
|
|
<h2>Local version control</h2>
|
|
|
|
<p>Procedurally, version control is easy with DataLad!</p>
|
|
<img class="fragment fade-in" src="../pics/local_wf.svg" height="500"> <!-- .element: class="fragment" -->
|
|
<br>
|
|
|
|
<b class="fragment fade-in">Advice:</b>
|
|
<ul>
|
|
<li class="fragment fade-in">Save <i>meaningful</i> units of change</li>
|
|
<li class="fragment fade-in">Attach helpful commit messages</li>
|
|
</ul>
|
|
</section>
|
|
|
|
<section data-markdown><script type="text/template" >
|
|
|
|
### This means: You can also version control data! <!-- .element: class="fragment" -->
|
|
|
|
<pre><code class="bash" style="max-height:none">$ datalad save \
|
|
-m "Adding raw data from neuroimaging study 1" \
|
|
sub-*
|
|
add(ok): sub-1/anat/T1w.json (file)
|
|
add(ok): sub-1/anat/T1w.nii.gz (file)
|
|
add(ok): sub-1/anat/T2w.json (file)
|
|
add(ok): sub-1/anat/T2w.nii.gz (file)
|
|
add(ok): sub-1/func/sub-1-run-1_bold.json (file)
|
|
add(ok): sub-1/func/sub-1-run-1_bold.nii.gz (file)
|
|
add(ok): sub-10/anat/T1w.json (file)
|
|
add(ok): sub-10/anat/T1w.nii.gz (file)
|
|
add(ok): sub-10/anat/T2w.json (file)
|
|
add(ok): sub-10/anat/T2w.nii.gz (file)
|
|
[110 similar messages have been suppressed]
|
|
save(ok): . (dataset)
|
|
action summary:
|
|
add (ok: 120)
|
|
save (ok: 1)
|
|
</code></pre> <!-- .element: class="fragment" -->
|
|
|
|
</script>
|
|
</section>
|
|
|
|
<section data-markdown><script type="text/template" >
|
|
## Version Control
|
|
* Your dataset can be a complete research log, capturing everything that was done, when, by whom, and how
|
|

|
|
* Interact with the history:
|
|
* reset your dataset (or subset of it) to a previous state,
|
|
* throw out changes or bring them back,
|
|
* find out what was done when, how, why, and by whom
|
|
* Identify precise versions: Use data in the most recent version, or the one from 2018, or...
|
|
* ...
|
|
</script>
|
|
</section>
|
|
|
|
<section>
|
|
<h3>Summary - Local version control</h3>
|
|
|
|
<dl>
|
|
<dt class="fragment fade-in"><code>datalad create</code> creates an empty dataset.</dt> <dd class="fragment fade-in">Configurations (<b>-c yoda</b>, <b>-c text2git</b>) are useful (details soon).</dd>
|
|
<br>
|
|
<dt class="fragment fade-in">A dataset has a <i>history</i> to track files and their modifications. </dt><dd class="fragment fade-in">Explore it with Git (<b>git log</b>) or external tools (e.g., <b>tig</b>).</dd>
|
|
<br>
|
|
<dt class="fragment fade-in"><code>datalad save</code> records the dataset or file state to the history. </dt><dd class="fragment fade-in">Concise <b>commit messages</b> should summarize the change for future you and others.</dd>
|
|
<br>
|
|
<dt class="fragment fade-in"><code>datalad download-url</code> obtains web content and records its origin. </dt><dd class="fragment fade-in">It even takes care of saving the change.</dd>
|
|
<br>
|
|
<dt class="fragment fade-in"><code>datalad status</code> reports the current state of the dataset.</dt>
|
|
<dd class="fragment fade-in">A clean dataset status (no modifications, not untracked files) is good practice.</dd>
|
|
</dl>
|
|
</section>
|
|
</section>
|
|
|
|
<section>
|
|
<section data-markdown><script type="text/template" >
|
|
## Consuming datasets
|
|
* A dataset can be created from scratch/existing directories:
|
|
<pre><code class="bash" style="max-height:none">$ datalad create mydataset
|
|
[INFO ] Creating a new annex repo at /home/adina/mydataset
|
|
create(ok): /home/adina/mydataset (dataset)
|
|
</code></pre>
|
|
* but datasets can also be installed from paths or from URLs:
|
|
<pre><code class="bash" style="max-height:none">$ datalad clone https://github.com/datalad-datasets/human-connectome-project-openaccess HCP
|
|
install(ok): /tmp/HCP (dataset)
|
|
</code></pre>
|
|
</script>
|
|
</section>
|
|
|
|
<section>
|
|
<h2>Plenty of data, but little disk-usage</h2>
|
|
<ul>
|
|
<li class="fragment fade-in-then-semi-out">Cloned datasets are lean.
|
|
"Meta data" (file names, availability) are present, but <b>no file content</b>:</li>
|
|
<pre class="fragment fade-in"><code>$ datalad clone git@github.com:psychoinformatics-de/studyforrest-data-phase2.git
|
|
install(ok): /tmp/studyforrest-data-phase2 (dataset)
|
|
$ cd studyforrest-data-phase2 && du -sh
|
|
18M .</code></pre>
|
|
|
|
<li class="fragment fade-in-then-semi-out"> file's contents can be retrieved on demand:</li>
|
|
</ul>
|
|
<pre class="fragment fade-in"><code>$ datalad get sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz
|
|
get(ok): /tmp/studyforrest-data-phase2/sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz (file) [from mddatasrc...]</code></pre>
|
|
|
|
<li class="fragment fade-in">Have more access to your computer than you have disk-space:</li>
|
|
<pre class="fragment fade-in"><code># eNKI dataset (1.5TB, 34k files):
|
|
$ du -sh
|
|
1.5G .
|
|
# HCP dataset (80TB, 15 million files)
|
|
$ du -sh
|
|
48G .
|
|
</code></pre>
|
|
</section>
|
|
|
|
<section data-markdown> <script type="text/template">
|
|
## Plenty of data, but little disk-usage
|
|
|
|
Drop file content that is not needed:<!-- .element: class="fragment fade-in" -->
|
|
<pre class="fragment fade-in-then-semi-out"><code>$ datalad drop sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz
|
|
drop(ok): /tmp/studyforrest-data-phase2/sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz (file) [checking https://arxiv.org/pdf/0904.3664v1.pdf...]</code></pre>
|
|
When files are dropped, only "meta data" stays behind, and they can be re-obtained on demand.
|
|
This allows disk-space aware computations: <!-- .element: class="fragment fade-in" -->
|
|
|
|
|
|
Install your input data <!-- .element: class="fragment fade-in" -->
|
|
*➡ get the data you need* <!-- .element: class="fragment fade-in" -->
|
|
*➡ compute your results* <!-- .element: class="fragment fade-in" -->
|
|
*➡ drop input data (and potentially all automatically re-computable results)* <!-- .element: class="fragment fade-in" -->
|
|
<pre><code class="python">dl.get('input/sub-01')
|
|
[really complex analysis]
|
|
dl.drop('input/sub-01')
|
|
</code></pre><!-- .element: class="fragment fade-in" -->
|
|
</script></section>
|
|
|
|
|
|
<section>
|
|
<h2>Prepare the next session...</h2>
|
|
|
|
Use your newly acquired DataLad skills to get ready for the next sessions.
|
|
Install data from datasets.datalad.org and retrieve it:
|
|
<pre><code style="bash">$ datalad clone ///adhd200/RawDataBIDS/Brown </code></pre>
|
|
</section>
|
|
|
|
<section>
|
|
<h2>Git versus Git-annex</h2>
|
|
<dl>
|
|
<dt>Data in datasets is either stored in Git or git-annex</dt>
|
|
<dd>By default, everything is <i>annexed</i>, i.e., stored in a dataset annex by git-annex</dd><br>
|
|
<img height="500" src="../pics/artwork/src/publishing/publishing_gitvsannex.svg">
|
|
<br><br>
|
|
<li class="fragment fade-in-then-semi-out">With annexed data, only content identity (hash)
|
|
and location information is put into Git, rather than file content.
|
|
The annex, and transport to and from it is managed with <b>git-annex</b>
|
|
</dl>
|
|
</section>
|
|
|
|
|
|
<section>
|
|
<h2>Git versus Git-annex</h2>
|
|
Git and Git-annex handle files differently: annexed files are stored in an annex.
|
|
File content is hashed & only content-identity is committed to Git.
|
|
<ul>
|
|
<table>
|
|
<tr>
|
|
<td>
|
|
<li>Files stored in Git are modifiable, files stored in Git-annex are content-locked</li>
|
|
</td>
|
|
<td width="60%">
|
|
<img src="../pics/git_vs_gitannex.svg" height="500">
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
|
|
<li>Annexed contents are not available right after cloning,
|
|
only content identity and availability information (as they are stored in Git).
|
|
Everything that is annexed needs to be retrieved with <code>datalad get</code> from whereever it is stored.
|
|
</li>
|
|
</ul><br><br>
|
|
<small>Read
|
|
<a href="http://handbook.datalad.org/en/latest/basics/101-115-symlinks.html" target="_blank">
|
|
this handbook chapter</a> for details
|
|
</a> </small>
|
|
</section>
|
|
|
|
|
|
<section>
|
|
<h2>Git versus Git-annex</h2>
|
|
<ul>
|
|
Users can decide which files are annexed:
|
|
<br><br>
|
|
<li><b>Pre-made run-procedures</b>, provided by DataLad (e.g., <code>text2git</code>, <code>yoda</code>)
|
|
or created and shared by users
|
|
(<a href="http://handbook.datalad.org/en/latest/basics/101-124-procedures.html" target="_blank">Tutorial</a>) </li>
|
|
<li>Self-made configurations in <code>.gitattributes</code> (e.g., based on file type,
|
|
file/path name, size, ...; <a href="http://handbook.datalad.org/en/latest/basics/101-123-config2.html#gitattributes" target="_blank">
|
|
rules and examples
|
|
</a> )</li>
|
|
<li>Per-command basis (e.g., via <code>datalad save --to-git</code>)</li>
|
|
</ul>
|
|
</section>
|
|
</section>
|
|
|
|
|
|
<section>
|
|
<section data-transition="None">
|
|
<h2>Dataset nesting</h2>
|
|
|
|
<ul>
|
|
<li>Seamless nesting mechanisms:
|
|
<img height="330" src="../pics/artwork/src/linkage_subds.svg">
|
|
<br>
|
|
<li class="fragment fade-in" data-fragment-index="2">Overcomes scaling issues with large amounts of files</li>
|
|
<pre class="fragment fade-in" data-fragment-index="2"><code>adina@bulk1 in /ds/hcp/super on git:master❱ datalad status --annex -r
|
|
15530572 annex'd files (77.9 TB recorded total size)
|
|
nothing to save, working tree clean</code></pre>
|
|
<small><a class="fragment fade-in" data-fragment-index="2" href="https://github.com/datalad-datasets/human-connectome-project-openaccess" target="_blank">(github.com/datalad-datasets/human-connectome-project-openaccess)</a></small>
|
|
<li class="fragment fade-in">
|
|
Modularizes research components for transparency, reuse, and access
|
|
management
|
|
</li>
|
|
<li class="fragment fade-in">Create a light-weight, actionable link
|
|
to your input data or software environment</li>
|
|
</ul>
|
|
|
|
|
|
|
|
<aside class="notes">
|
|
Two advantages:
|
|
<ul>
|
|
<li>Scalable, size-independent version control</li>
|
|
<li>Modularization of research components to increase transparency
|
|
and aid component reuse, as individual components can be flexibly
|
|
puzzled together into new research objects, while being uniquely identified and versioned</li>
|
|
</ul>
|
|
|
|
At this point: Fixed data management, laid a foundation for updating data
|
|
</aside>
|
|
</section>
|
|
|
|
|
|
|
|
<section>
|
|
<h2>Dataset nesting</h2>
|
|
<img src="../pics/linkage.svg" height="500">
|
|
</section>
|
|
|
|
<section>
|
|
<h2>DataLad: Dataset linkage</h2>
|
|
<img data-src="../pics/linkage.svg" height="300">
|
|
<pre><code class="bash" style="font-size:115%;max-height:none">$ datalad clone --dataset . http://example.com/ds inputs/rawdata
|
|
</code></pre>
|
|
|
|
<pre><code class="diff" style="max-height:none">$ git diff HEAD~1
|
|
diff --git a/.gitmodules b/.gitmodules
|
|
new file mode 100644
|
|
index 0000000..c3370ba
|
|
--- /dev/null
|
|
+++ b/.gitmodules
|
|
@@ -0,0 +1,3 @@
|
|
+[submodule "inputs/rawdata"]
|
|
+ path = inputs/rawdata
|
|
+ url = http://example.com/importantds
|
|
diff --git a/inputs/rawdata b/inputs/rawdata
|
|
new file mode 160000
|
|
index 0000000..fabf852
|
|
--- /dev/null
|
|
+++ b/inputs/rawdata
|
|
@@ -0,0 +1 @@
|
|
+Subproject commit fabf8521130a13986bd6493cb33a70e580ce8572
|
|
</code></pre>
|
|
<aside class="notes">weighs just a few bytes</aside>
|
|
</section>
|
|
|
|
<section>
|
|
<h3>Summary - Dataset consumption & nesting</h3>
|
|
|
|
<ul>
|
|
<dt class="fragment fade-in"><code>datalad clone</code> installs a dataset.</dt><dd class="fragment fade-in"> It can be installed “on its own”:
|
|
Specify the source (url, path, ...) of the dataset, and an optional <b>path</b> for it to be installed to.</dd>
|
|
<br>
|
|
<dt class="fragment fade-in">Datasets can be installed as subdatasets within an existing dataset. </dt> <dd class="fragment fade-in"> The <b>--dataset/-d</b> option needs a path to the root of the superdataset.</dd>
|
|
<br>
|
|
<dt class="fragment fade-in">Only small files and metadata about file availability are present locally after an install. </dt>
|
|
<dd class="fragment fade-in">To retrieve actual file content of annexed files,
|
|
<code>datalad get </code> downloads file content on demand.</dd>
|
|
<br>
|
|
<dt class="fragment fade-in">Datasets preserve their history.</dt> <dd class="fragment fade-in">The superdataset records only the <i>version state</i> of the subdataset.</dd>
|
|
|
|
</ul>
|
|
</section>
|
|
</section>
|
|
|
|
|
|
<section>
|
|
<section data-transition="fade">
|
|
<h2>reproducible data analysis</h2>
|
|
Your past self is the worst collaborator:
|
|
<img src="../pics/ownlegacycode_phd.png" height="500">
|
|
<imgcredit>Full comic at <a href="http://phdcomics.com/comics.php?f=1689">http://phdcomics.com/comics.php?f=1979</a></imgcredit>
|
|
</p>
|
|
</section>
|
|
|
|
<section>
|
|
<h2>Basic organizational principles for datasets</h2>
|
|
<dl>
|
|
<dt>Keep everything clean and modular</dt>
|
|
<li>An analysis is a superdataset, its components are subdatasets, and its structure modular</li>
|
|
<table>
|
|
<tr>
|
|
<td><img src="../pics/dataset_modules.png" height="400"></td>
|
|
<td><pre><code class="bash" style="max-height:none">├── code/
|
|
│ ├── tests/
|
|
│ └── myscript.py
|
|
├── docs
|
|
│ ├── build/
|
|
│ └── source/
|
|
├── envs
|
|
│ └── Singularity
|
|
├── inputs/
|
|
│ └─── data/
|
|
│ ├── dataset1/
|
|
│ │ └── datafile_a
|
|
│ └── dataset2/
|
|
│ └── datafile_a
|
|
├── outputs/
|
|
│ └── important_results/
|
|
│ └── figures/
|
|
└── README.md</code></pre></td>
|
|
</tr>
|
|
</table>
|
|
|
|
</dl>
|
|
<ul>
|
|
<li>do not touch/modify raw data: save any results/computations <i>outside</i> of input datasets</li>
|
|
<li>Keep a superdataset self-contained: Scripts reference subdatasets or files with <i>relative paths</i></li>
|
|
</ul>
|
|
</section>
|
|
|
|
<section>
|
|
<h2>Basic organizational principles for datasets</h2>
|
|
<dl>
|
|
<dt>Record where you got it from, where it is now, and what you do to it</dt>
|
|
<li>Link datasets (as subdatasets), record data origin</li>
|
|
<li>Collect and store provenance of all contents of a dataset that you create</li>
|
|
<table style="verticala-lign:middle">
|
|
<tr><img src="../pics/dataset_linkage_provenance.png"></tr>
|
|
</table>
|
|
<dl>
|
|
<dt>Document everything:</dt>
|
|
<li>Which script produced which output? From which data? In which software environment? ... </li>
|
|
</dl>
|
|
</dl>
|
|
<note>Find out more about organizational principles in
|
|
<a href="" target="_blank">the YODA principles</a>!</note>
|
|
</section>
|
|
|
|
|
|
<section>
|
|
<h2>Reproducible execution & provenance capture</h2>
|
|
|
|
<p>datalad run</p>
|
|
<img class="fragment fade-in" src="../pics/run_prov.svg" height="600"> <!-- .element: class="fragment" -->
|
|
</section>
|
|
|
|
<section>
|
|
<h2>Computationally reproducible execution & provenance capture</h2>
|
|
<ul>
|
|
<li>Code may fail (to reproduce) if run with different software</li>
|
|
<li>Datasets can store (and share) software environments (Docker or Singularity containers)
|
|
and reproducibly execute code inside of the software container, capturing software as additional
|
|
provenance</li>
|
|
<li>DataLad extension: <code>datalad-container</code></li>
|
|
</ul>
|
|
|
|
<p>datalad-containers run</p>
|
|
<img class="fragment fade-in" src="../pics/containers-run.svg" height="600"> <!-- .element: class="fragment" -->
|
|
</section>
|
|
|
|
<section>
|
|
<h3>Summary - Reproducible execution</h3>
|
|
|
|
<ul>
|
|
<dt class="fragment fade-in"><code>datalad run</code> records a command and
|
|
its impact on the dataset.</dt>
|
|
<dd class="fragment fade-in">All dataset modifications are saved - use it
|
|
in a clean dataset.</dd>
|
|
<br>
|
|
<dt class="fragment fade-in">Data/directories specified as <code>--input</code>
|
|
are retrieved prior to command execution.</dt>
|
|
<dd class="fragment fade-in"> Use one flag per input.</dd>
|
|
<br>
|
|
<dt class="fragment fade-in">Data/directories specified as <code>--output</code>
|
|
will be unlocked for modifications prior to a rerun of the command. </dt>
|
|
<dd class="fragment fade-in">Its optional to specify, but helpful for recomputations.</dd>
|
|
<br>
|
|
<dt class="fragment fade-in"><code>datalad containers-run</code> can be used
|
|
to capture the software environment as provenance.</dt>
|
|
<dd class="fragment fade-in">Its ensures computations are ran in the desired software set up.
|
|
Supports Docker and Singularity containers</dd>
|
|
<br>
|
|
<dt class="fragment fade-in"><code>datalad rerun</code> can automatically re-execute run-records later.</dt>
|
|
<dd class="fragment fade-in">They can be identified with any commit-ish (hash, tag, range, ...)</dd>
|
|
|
|
</ul>
|
|
</section>
|
|
|
|
</section>
|
|
|
|
<section>
|
|
<section data-transition="None">
|
|
<h2>Interoperability</h2>
|
|
<ul>
|
|
<li>DataLad is built to maximize interoperability and use with hosting and
|
|
storage technology</li>
|
|
</ul>
|
|
<img class="fragment fade-in" src="../pics/services_only.png" height="650">
|
|
</section>
|
|
|
|
|
|
<section data-transition="None">
|
|
<h2>Interoperability</h2>
|
|
<ul>
|
|
<li>DataLad is built to maximize interoperability and use with hosting and
|
|
storage technology</li>
|
|
</ul>
|
|
<img src="../pics/services_connected.png" height="650">
|
|
</section>
|
|
|
|
<section>
|
|
<h2>Publishing datasets</h2>
|
|
I have a dataset on my computer. How can I share it, or collaborate on it?
|
|
<br>
|
|
General information: <a href="http://handbook.datalad.org/r.html?publish" target="_blank">
|
|
handbook.datalad.org/r.html?publish</a>
|
|
<img height="900" src="../pics/artwork/src/publishing/startingpoint.svg">
|
|
<ul class="fragment fade-in">
|
|
Today: <a href="http://handbook.datalad.org/r.html?GIN" target="_blank">
|
|
Publishing a dataset to Gin</a>
|
|
</ul>
|
|
</section>
|
|
|
|
|
|
<section>
|
|
<h2>gin.g-node.org</h2>
|
|
|
|
<a href="https://gin.g-node.org/" target="_blank">Gin</a> is a free repository hosting service.
|
|
<img src="../pics/screenshot-gin1.png">
|
|
</section>
|
|
|
|
|
|
<section>
|
|
<h2>Step-by-Step: Webinterface</h2>
|
|
<ul>
|
|
<li>Log into Gin</li>
|
|
<li>Upload your SSH key</li>
|
|
<img src="../pics/screenshot-gin3.png" height="300">
|
|
<li>Create a new repository</li>
|
|
<img src="../pics/screenshot-gin4.png" height="750">
|
|
</ul>
|
|
</section>
|
|
|
|
|
|
<section>
|
|
<h2>Step-by-Step: Command line</h2>
|
|
<ul>
|
|
<li>Add the Gin repository</li>
|
|
<img src="../pics/screenshot-gin5.png">
|
|
<pre><code>$ datalad siblings add -d . \
|
|
--name gin \
|
|
--url git@gin.g-node.org:/adswa/bids-data.git</code></pre>
|
|
<li>Push your data</li>
|
|
<pre><code>$ datalad push --to gin
|
|
</code></pre>
|
|
</ul>
|
|
</section>
|
|
<section>
|
|
<ul>
|
|
<img src="../pics/screenshot-gin6.png">
|
|
<li>Share it with your collaborators</li>
|
|
<pre><code >$ datalad clone https://gin.g-node.org/adswa/bids-data
|
|
</code></pre>
|
|
</ul>
|
|
</section>
|
|
|
|
<section>
|
|
<h2>datalad rerun</h2>
|
|
<ul>
|
|
<li>
|
|
<code>datalad rerun</code> is helpful to spare others and yourself
|
|
the short- or long-term memory task, or the forensic skills to figure
|
|
out how you performed an analysis
|
|
</li>
|
|
<li>
|
|
But it is also a digital and machine-reable provenance record
|
|
</li>
|
|
<li>
|
|
Important: The better the run command is specified, the better the
|
|
provenance record
|
|
</li>
|
|
<li>
|
|
Note: run and rerun only create an entry in the history if the command execution
|
|
leads to a change.
|
|
</li>
|
|
</ul>
|
|
</section>
|
|
|
|
</section>
|
|
|
|
|
|
<section>
|
|
<section data-transition="None">
|
|
<h2>Why use DataLad?</h2>
|
|
<ul>
|
|
<li class="fragment fade-in">Mistakes are not forever anymore: Easy version control, regardless of file size</li>
|
|
<li class="fragment fade-in">Who needs short-term memory when you can have run-records?</li>
|
|
<li class="fragment fade-in">Disk-usage magic: Have access to more data than your hard drive has space</li>
|
|
<li class="fragment fade-in">Collaboration and updating mechanisms: Alice shares her data with Bob. Alice fixes a mistake and pushes the fix.
|
|
Bob says "datalad update" and gets her changes. And vice-versa.</li>
|
|
<li class="fragment fade-in">Transparency: Shared datasets keep their history. No need to track down a former student,
|
|
ask their project what was done.</li>
|
|
</ul>
|
|
</section>
|
|
|
|
<section>
|
|
<h2>Acknowledgements</h2>
|
|
<table>
|
|
<tr style="vertical-align:middle">
|
|
<td style="vertical-align:middle">
|
|
<dl>
|
|
<dt>Software</dt>
|
|
<dd style="margin-left:5px!important">
|
|
<ul style="margin-left:5px!important">
|
|
<li>Michael Hanke (INM-7)</li>
|
|
<li>Yaroslav Halchenko</li>
|
|
<li>Joey Hess (git-annex)</li>
|
|
<li>Kyle Meyer</li>
|
|
<li>Benjamin Poldrack (INM-7)</li>
|
|
<li><em>26 additional contributors</em></li>
|
|
</ul>
|
|
</dd>
|
|
<dt style="margin-top:20px">Documentation project </dt>
|
|
<dd style="margin-left:5px!important">
|
|
<ul style="margin-left:5px!important">
|
|
<li>Michael Hanke (INM-7)</li>
|
|
<li>Laura Waite (INM-7)</li>
|
|
<li><em>28 additional contributors</em></li>
|
|
</ul>
|
|
</dd>
|
|
</dl>
|
|
</td>
|
|
<td style="vertical-align:middle">
|
|
<div style="margin-bottom:-20px;text-align:center"><strong>Funders</strong></div>
|
|
<img style="height:150px;margin-right:50px" data-src="../pics/nsf.png" />
|
|
<img style="height:150px;margin-right:50pxi;margin-left:50px" data-src="../pics/binc.png" />
|
|
<img style="height:150px;margin-left:50px" data-src="../pics/bmbf.png" />
|
|
<br />
|
|
<img style="height:80px;margin-top:-40px;margin-left:auto;margin-right:auto;width:100%" data-src="../pics/fzj_logo.svg" />
|
|
<div style="margin-top:-20px">
|
|
<img style="height:60px;margin-right:20px" data-src="../pics/erdf.png" />
|
|
<img style="height:60px;margin-right:20px" data-src="../pics/cbbs_logo.png" />
|
|
<img style="height:60px" data-src="../pics/LSA-Logo.png" />
|
|
</div>
|
|
<div style="margin-top:40px;margin-bottom:20px;text-align:center"><strong>Collaborators</strong></div>
|
|
<div style="margin-top:-20px">
|
|
<img style="height:100px;margin:20px" data-src="../pics/hbp_logo.png" />
|
|
<img style="height:100px;margin:20px" data-src="../pics/conp_logo.png" />
|
|
<img style="height:100px;margin:20px" data-src="../pics/vbc_logo.png" />
|
|
</div>
|
|
<div style="margin-top:-40px">
|
|
<img style="height:120px;margin:20px" data-src="../pics/openneuro_logo.png" />
|
|
<img style="height:120px;margin:20px" data-src="../pics/cbrain_logo.png" />
|
|
<img style="height:140px;margin:20px" data-src="../pics/brainlife_logo.png" />
|
|
</div>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
</section>
|
|
|
|
<section>
|
|
<h2>Unlocking things</h2>
|
|
<ul>
|
|
<li><code>datalad run</code> "unlocks" everything specified as <code>--output</code></li>
|
|
<li class="fragment fade-in" data-fragment-index="1">Outside of <code>datalad run</code>, you can use <code>datalad unlock</code></li>
|
|
<li class="fragment fade-in" data-fragment-index="1">This makes annex'ed files <i>writeable</i>:</li>
|
|
</ul>
|
|
<pre class="fragment fade-in" data-fragment-index="1"><code class="fragment fade-in" data-fragment-index="1">$ ls -l myfile
|
|
lrwxrwxrwx 1 adina adina 108 Nov 17 07:08 myfile -> .git/annex/objects/22/Gw/MD5E-s7--f447b20a7fcbf53a5d5be013ea0b15af/MD5E-s7--f447b20a7fcbf53a5d5be013ea0b15af
|
|
|
|
# unlocking
|
|
$ datalad unlock myfile
|
|
unlock(ok): myfile (file)
|
|
$ ls -l myfile
|
|
-rw-r--r-- 1 adina adina 7 Nov 17 07:08 myfile # not a symlink anymore!
|
|
</code></pre>
|
|
<ul>
|
|
<li class="fragment fade-in" data-fragment-index="2"><code>datalad save</code> "locks" the file again</li>
|
|
</ul>
|
|
<pre class="fragment fade-in" data-fragment-index="2"><code class="fragment fade-in" data-fragment-index="2">$ datalad save
|
|
add(ok): myfile (file)
|
|
action summary:
|
|
add (ok: 1)
|
|
save (notneeded: 1)
|
|
|
|
$ ls -l myfile
|
|
lrwxrwxrwx 1 adina adina 108 Nov 17 07:08 myfile -> .git/annex/objects/22/Gw/MD5E-s7--f447b20a7fcbf53a5d5be013ea0b15af/MD5E-s7--f447b20a7fcbf53a5d5be013ea0b15af</code></pre>
|
|
<div class="fragment fade-in" data-fragment-index="3">Some tools (e.g., MatLab) don't like
|
|
symlinks. Unlocking or running matlab with "datalad run" helps!</div>
|
|
</section>
|
|
|
|
|
|
<section>
|
|
<h2>Removing datasets</h2>
|
|
<ul>
|
|
<li>
|
|
As mentioned before, annexed data is write-protected.
|
|
So when you try to <code>rm -rf</code> a dataset, this happens:
|
|
</li>
|
|
</ul>
|
|
<pre class="fragment fade-in" data-fragment-index="1"><code class="fragment fade-in" data-fragment-index="1">$ rm -rf mydataset
|
|
rm: cannot remove 'mydataset/.git/annex/objects/70/GM/MD5E-s27246--8b7ea027f6db1cda7af496e97d4eb7c9.png/MD5E-s27246--8b7ea027f6db1cda7af496e97d4eb7c9.png': Permission denied
|
|
rm: cannot remove 'mydataset/.git/annex/objects/70/GM/MD5E-s35756--af496e97d4eb7c98b7ea027f6db1cda7.png/MD5E-s27246--af496e97d4eb7c98b7ea027f6db1cda7.png': Permission denied
|
|
[...]
|
|
</code></pre>
|
|
😱
|
|
|
|
<li class="fragment fade-in">
|
|
(If you accidentally ever do this, you need to apply write permissions recursively to
|
|
all files)
|
|
<pre><code>$ chmod -R +w mydataset
|
|
$ rm -rf mydataset # success!
|
|
</code></pre>
|
|
</li>
|
|
</section>
|
|
|
|
<section>
|
|
<h2>Removing datasets</h2>
|
|
<li>
|
|
The correct way to remove a dataset is using <code>datalad remove</code>:
|
|
</li>
|
|
<pre><code>$ datalad remove -d ds001241
|
|
remove(ok): . (dataset)
|
|
action summary:
|
|
drop (notneeded: 1)
|
|
remove (ok: 1)
|
|
</code></pre>
|
|
<li class="fragment fade-in" data-fragment-index="2">
|
|
If a dataset contains file for which no other remote copy is known, you'll
|
|
get a warning:
|
|
</li>
|
|
<pre class="fragment fade-in" data-fragment-index="2"><code class="fragment fade-in" data-fragment-index="2">$ datalad remove -d mydataset
|
|
[WARNING] Running drop resulted in stderr output: git-annex: drop: 1 failed
|
|
|
|
[ERROR ] unsafe; Could only verify the existence of 0 out of 1 necessary copies; Rather than dropping this file, try using: git annex move; (Use --force to override this check, or adjust numcopies.) [drop(/tmp/mydataset/interdisciplinary.png)]
|
|
drop(error): interdisciplinary.png (file) [unsafe; Could only verify the existence of 0 out of 1 necessary copies; Rather than dropping this file, try using: git annex move; (Use --force to override this check, or adjust numcopies.)]
|
|
[WARNING] could not drop some content in /tmp/mydataset ['/tmp/mydataset/interdisciplinary.png'] [drop(/tmp/mydataset)]
|
|
drop(impossible): . (directory) [could not drop some content in /tmp/mydataset ['/tmp/mydataset/interdisciplinary.png']]
|
|
action summary:
|
|
drop (error: 1, impossible: 1)</code></pre>
|
|
<li class="fragment fade-in" data-fragment-index="3">
|
|
In that case, use <code>--nocheck</code> to force removal:
|
|
</li>
|
|
<pre class="fragment fade-in" data-fragment-index="3"><code class="fragment fade-in" data-fragment-index="2">$ datalad remove -d mydataset --nocheck 1 !
|
|
remove(ok): . (dataset)
|
|
</code></pre>
|
|
</section>
|
|
|
|
<section>
|
|
<h2>Removing datasets</h2>
|
|
<li>
|
|
If a dataset contains subdatasets, <code>datalad remove</code> will also error:
|
|
</li>
|
|
<pre class="fragment fade-in" data-fragment-index="1"><code class="fragment fade-in" data-fragment-index="1">$ datalad remove -d myds
|
|
drop(ok): README.md (file) [locking gin...]
|
|
drop(ok): . (directory)
|
|
[ERROR ] to be uninstalled dataset Dataset(/tmp/myds) has present subdatasets, forgot --recursive? [remove(/tmp/myds)]
|
|
remove(error): . (dataset) [to be uninstalled dataset Dataset(/tmp/myds) has present subdatasets, forgot --recursive?]
|
|
action summary:
|
|
drop (ok: 3)
|
|
remove (error: 1)</code></pre>
|
|
<li class="fragment fade-in" data-fragment-index="2">
|
|
In that case, use <code>--recursive</code> to remove all subdatasets, too:
|
|
</li>
|
|
<pre class="fragment fade-in" data-fragment-index="2"><code class="fragment fade-in" data-fragment-index="2">$ datalad remove -d myds --recursive
|
|
uninstall(ok): input (dataset)
|
|
remove(ok): . (dataset)
|
|
action summary:
|
|
drop (notneeded: 2)
|
|
remove (ok: 1)
|
|
uninstall (ok: 1)
|
|
</code></pre>
|
|
<li class="fragment fade-in">
|
|
A complete overview of file system operations is in
|
|
<a href="http://handbook.datalad.org/en/latest/basics/101-136-filesystem.html" target="_blank">
|
|
handbook.datalad.org/en/latest/basics/101-136-filesystem.html
|
|
</a>
|
|
</li>
|
|
</section>
|
|
|
|
</section>
|
|
|
|
|
|
</div>
|
|
</div>
|
|
|
|
<script src="../reveal.js/dist/reveal.js"></script>
|
|
<script src="../reveal.js/plugin/notes/notes.js"></script>
|
|
<script src="../reveal.js/plugin/markdown/markdown.js"></script>
|
|
<script src="../reveal.js/plugin/highlight/highlight.js"></script>
|
|
<script>
|
|
// More info about initialization & config:
|
|
// - https://revealjs.com/initialization/
|
|
// - https://revealjs.com/config/
|
|
Reveal.initialize({
|
|
hash: true,
|
|
// The "normal" size of the presentation, aspect ratio will be preserved
|
|
// when the presentation is scaled to fit different resolutions. Can be
|
|
// specified using percentage units.
|
|
width: 1280,
|
|
height: 960,
|
|
// Factor of the display size that should remain empty around the content
|
|
margin: 0.3,
|
|
// Bounds for smallest/largest possible scale to apply to content
|
|
minScale: 0.2,
|
|
maxScale: 1.0,
|
|
|
|
controls: true,
|
|
progress: true,
|
|
history: true,
|
|
center: true,
|
|
slideNumber: 'c',
|
|
pdfSeparateFragments: false,
|
|
pdfMaxPagesPerSlide: 1,
|
|
pdfPageHeightOffset: -1,
|
|
transition: 'slide', // none/fade/slide/convex/concave/zoom
|
|
// Learn about plugins: https://revealjs.com/plugins/
|
|
plugins: [ RevealMarkdown, RevealHighlight, RevealNotes ]
|
|
});
|
|
</script>
|
|
</body>
|
|
</html>
|