432 lines
15 KiB
HTML
432 lines
15 KiB
HTML
<!doctype html>
|
|
<html>
|
|
<head>
|
|
<meta charset="utf-8">
|
|
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no">
|
|
|
|
<!-- Edit me start! -->
|
|
<title>This is where your title goes</title>
|
|
<meta name="description" content=" This is where you put a short description ">
|
|
<meta name="author" content=" Your Name ">
|
|
<!-- Edit me end! -->
|
|
|
|
<link rel="stylesheet" href="../reveal.js/dist/reset.css">
|
|
<link rel="stylesheet" href="../reveal.js/dist/reveal.css">
|
|
<link rel="stylesheet" href="../reveal.js/dist/theme/beige.css">
|
|
|
|
<!-- Theme used for syntax highlighted code -->
|
|
<link rel="stylesheet" href="../reveal.js/plugin/highlight/monokai.css">
|
|
</head>
|
|
<body>
|
|
<div class="reveal">
|
|
<div class="slides">
|
|
|
|
<section>
|
|
<section>
|
|
<h2><small>OHBM Brainhack Traintrack</small></br>An Introduction to DataLad</h2>
|
|
|
|
<div style="margin-top:1em;text-align:center">
|
|
<table style="border: none;">
|
|
<tr>
|
|
<td>Adina Wagner
|
|
<br><small>
|
|
<a href="https://twitter.com/AdinaKrik" target="_blank">
|
|
<img data-src="../pics/twitter.png" style="height:30px;margin:0px" />
|
|
@AdinaKrik</a></small></td>
|
|
<td><img style="height:100px;margin-right:10px" data-src="../pics/fzj_logo.svg" />
|
|
<br></td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<small><a href="http://psychoinformatics.de" target="_blank">Psychoinformatics lab</a>,
|
|
<br> Institute of Neuroscience and
|
|
Medicine, Brain & Behavior (INM-7)<br>
|
|
Research Center Jülich</small><br>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
</div>
|
|
<br><br><small>
|
|
Slides: <a
|
|
href="https://github.com/datalad-handbook/course/blob/master/talks/PDFs/OHBM.pdf" target="_blank">
|
|
https://github.com/datalad-handbook/course/</a></small>
|
|
</a>
|
|
</section>
|
|
|
|
<section>
|
|
<h2>Learn all about DataLad at handbook.datalad.org</h2>
|
|
<img src="../pics/handbook_frontpage_new.png"> <br>
|
|
</section>
|
|
</section>
|
|
<!--...INTRODUCTION...-->
|
|
|
|
<section>
|
|
<section>
|
|
<h2> <img src="../pics/datalad_logo_wide.svg"> in brief</h2>
|
|
<ul>
|
|
<li>A command-line tool with Python API</li>
|
|
<li>Build on top of <a href="https://git-scm.com/" target="_blank">Git</a>
|
|
and <a href="https://git-annex.branchable.com/" target="_blank">Git-annex</a></li>
|
|
<dt><li>Allows...</li></dt>
|
|
<dd>... version-controlling arbitrarily large content,</dd>
|
|
<dd>... easily sharing and obtaining data (note: no data hosting!),</dd>
|
|
<dd>... (computationally) reproducible data analysis,
|
|
<dd>... and <i>much</i> more </dd>
|
|
<li>Completely domain-agnostic</li>
|
|
<li>available for all major operating systems (Linux, macOS/OSX, Windows)</li>
|
|
<br>
|
|
</ul>
|
|
</section>
|
|
|
|
<section>
|
|
<h2>Step 1: Install datalad</h2>
|
|
<img src="../pics/installhandbook.png">
|
|
</section>
|
|
|
|
<section>
|
|
<h2>Step 2: Configure your git identity</h2>
|
|
><pre><code class="bash" style="max-height:none">git config --global --add user.name "Firstname Lastname"
|
|
git config --global --add user.email "some@email.com"
|
|
</code></pre>
|
|
</section>
|
|
|
|
<section>
|
|
<h2>Let's start!</h2>
|
|
</section>
|
|
|
|
<section>
|
|
<h2>Follow along!</h2>
|
|
|
|
<img src="../pics/handbook_frontpage_new.png"> <br>
|
|
Code to follow along:
|
|
<a href="http://handbook.datalad.org/en/latest/code_from_chapters/OHBM.html" target="_blank">
|
|
http://handbook.datalad.org/en/latest/code_from_chapters/OHBM.html
|
|
</a>
|
|
</section>
|
|
|
|
|
|
<section>
|
|
<h2>DataLad Datasets</h2>
|
|
|
|
<ul>
|
|
<li>DataLad's core data structure</li>
|
|
<ul>
|
|
<li>Dataset = A directory managed by DataLad</li>
|
|
<li>Any directory of your computer can be managed by DataLad.</li>
|
|
<li>Datasets can be <i>created</i> (from scratch) or <i>installed</i></li>
|
|
<li>Datasets can be nested: <i>linked subdirectories</i></li>
|
|
</ul>
|
|
</ul>
|
|
|
|
<aside class="notes">
|
|
<li>anything can be managed: CV, website, music library, phd</li>
|
|
<li>show this on the manuscript repo: history, looks/feels</li>
|
|
</aside>
|
|
</section>
|
|
|
|
<section>
|
|
<h1>Local version control</h1>
|
|
</section>
|
|
|
|
<section>
|
|
<h2>Local version control</h2>
|
|
|
|
<p>Procedurally, version control is easy with DataLad!</p>
|
|
<img src="../pics/local_wf.svg" height="500"> <!-- .element: class="fragment" -->
|
|
<br>
|
|
|
|
<b>Advice:</b>
|
|
<ul>
|
|
<li>Save <i>meaningful</i> units of change</li>
|
|
<li>Attach helpful commit messages</li>
|
|
</ul>
|
|
</section>
|
|
|
|
<section>
|
|
<h3>Summary - Local version control</h3>
|
|
|
|
<dl>
|
|
<dt> <code>datalad create</code> creates an empty dataset.</dt> <dd>Configurations (<b>-c yoda</b>, <b>-c text2git</b>) are useful.</dd>
|
|
<br>
|
|
<dt>A dataset has a <i>history</i> to track files and their modifications. </dt><dd >Explore it with Git (<b>git log</b>) or external tools (e.g., <b>tig</b>).</dd>
|
|
<br>
|
|
<dt><code>datalad save</code> records the dataset or file state to the history. </dt><dd >Concise <b>commit messages</b> should summarize the change for future you and others.</dd>
|
|
<br>
|
|
<dt><code>datalad status</code> reports the current state of the dataset.</dt> <dd">A clean dataset status is good practice.</dd>
|
|
</dl>
|
|
</section>
|
|
|
|
|
|
<section data-markdown><script type="text/template">
|
|
## From here <span data-fragment-index="1" style="margin-left:350px">to this:</span>
|
|
<!-- .element: height="780" style="box-shadow: 10px 10px 8px #888888" -->
|
|
<!-- .element: data-fragment-index="1" height="780" style="box-shadow: 10px 10px 8px #888888" -->
|
|
<imgcredit>www.phdcomics.com; www.linode.com</imgcredit>
|
|
|
|
|
|
<aside class="notes">
|
|
Note to self
|
|
</aside>
|
|
</script>
|
|
</section>
|
|
|
|
<section>
|
|
<h1>Consuming datasets and dataset nesting</h1>
|
|
</section>
|
|
|
|
<section>
|
|
<h2>Consuming datasets</h2>
|
|
<img src="../pics/virtual_dstree_dl101.svg" height="600">
|
|
<ul>
|
|
<li>Datasets are light-weight: Upon installation, only small
|
|
files and meta data about file availability are retrieved.</li>
|
|
<li>Content can be obtained on demand via <code>datalad get</code>.</li>
|
|
</ul>
|
|
</section>
|
|
|
|
<section>
|
|
<img src="../pics/HCP.png">
|
|
</section>
|
|
|
|
<section>
|
|
<h2>Dataset nesting</h2>
|
|
<img src="../pics/linkage.svg" height="500">
|
|
</section>
|
|
|
|
<section>
|
|
<h3>Summary - Dataset consumption & nesting</h3>
|
|
|
|
<ul>
|
|
<dt><code>datalad clone</code> installs a dataset.</dt><dd> It can be installed “on its own”:
|
|
Specify the source (url, path, ...) of the dataset, and an optional <b>path</b> for it to be installed to.</dd>
|
|
<br>
|
|
<dt>Datasets can be installed as subdatasets within an existing dataset. </dt> <dd> The <b>--dataset/-d</b> option needs a path to the root of the superdataset.</dd>
|
|
<br>
|
|
<dt>Only small files and metadata about file availability are present locally after an install. </dt><dd>To retrieve actual file content of larger files, <code>datalad get </code> downloads large file content on demand.</dd>
|
|
<br>
|
|
<li>Content can be dropped to save disk space with <code>datalad drop.</code><dd>Do this only if content can be easily reobtained.</dd>
|
|
<br>
|
|
<dt>Datasets preserve their history.</dt> <dd>In nested datasets, the superdataset records only the <i>version state</i> of the subdataset.</dd>
|
|
|
|
</ul>
|
|
</section>
|
|
|
|
<section>
|
|
<h2>Example: reproducible research objects</h2>
|
|
<img src="../pics/remodnavrepo.png"> <br>
|
|
Find this repo at <a href="https://github.com/psychoinformatics-de/paper-remodnav", target="_blank">github.com/psychoinformatics-de/paper-remodnav</a><br>
|
|
Read all about it at <a href="http://handbook.datalad.org/en/latest/usecases/reproducible-paper.html" target="_blank">handbook.datalad.org/en/latest/usecases/reproducible-paper.html</a>
|
|
</section>
|
|
|
|
<section>
|
|
<h2>Advantages of nesting</h2>
|
|
<ul>
|
|
<li>A modular structure makes individual components (with their respective provenance) reusable.</li>
|
|
<li>Nesting can flexibly link all components and allows recursive operations across dataset boundaries</li>
|
|
<li>Read all about this in the <a href="http://handbook.datalad.org/en/latest/basics/101-127-yoda.html" target="_blank">chapter on YODA principles</a></li>
|
|
</ul>
|
|
<img src="../pics/linkage_subds.png" height="400">
|
|
</section>
|
|
|
|
<section data-transition="fade">
|
|
<h2>reproducible data analysis</h2>
|
|
<img src="../pics/ownlegacycode_phd.png" height="500">
|
|
<imgcredit>Full comic at <a href="http://phdcomics.com/comics.php?f=1689">http://phdcomics.com/comics.php?f=1979</a></imgcredit>
|
|
|
|
</section>
|
|
|
|
<section>
|
|
<h2>Basic organizational principles for datasets</h2>
|
|
Read all about this in the <a href="http://handbook.datalad.org/en/latest/basics/101-127-yoda.html" target="_blank">chapter on YODA principles</a>
|
|
|
|
<dl>
|
|
<li>Keep everything clean and modular</li>
|
|
<table>
|
|
<tr>
|
|
<td><img src="../pics/dataset_modules.png" height="400"></td>
|
|
<td><pre><code class="bash" style="max-height:none">├── code/
|
|
│ ├── tests/
|
|
│ └── myscript.py
|
|
├── docs
|
|
│ ├── build/
|
|
│ └── source/
|
|
├── envs
|
|
│ └── Singularity
|
|
├── inputs/
|
|
│ └─── data/
|
|
│ ├── dataset1/
|
|
│ │ └── datafile_a
|
|
│ └── dataset2/
|
|
│ └── datafile_a
|
|
├── outputs/
|
|
│ └── important_results/
|
|
│ └── figures/
|
|
└── README.md</code></pre></td>
|
|
</tr>
|
|
</table>
|
|
|
|
</dl>
|
|
<ul>
|
|
<li>do not touch/modify raw data: save any results/computations <i>outside</i> of input datasets</li>
|
|
<li>Keep a superdataset self-contained: Scripts reference subdatasets or files with <i>relative paths</i></li>
|
|
</ul>
|
|
</section>
|
|
|
|
<section>
|
|
<h2>Basic organizational principles for datasets</h2>
|
|
<dl>
|
|
<dt>Record where you got it from, where it is now, and what you do to it</dt>
|
|
<li>Link datasets (as subdatasets), record data origin</li>
|
|
<li>Collect and store provenance of all contents of a dataset that you create</li>
|
|
<table style="verticala-lign:middle">
|
|
<tr><img src="../pics/dataset_linkage_provenance.png"></tr>
|
|
</table>
|
|
<li>Record command execution: Which script produced which output? From which data? In which software environment? ... </li>
|
|
|
|
</dl>
|
|
|
|
</section>
|
|
|
|
<section>
|
|
<h2>A classification analysis on the iris flower dataset</h2>
|
|
<img src="../pics/iris-machinelearning.png" height="300">
|
|
<img src="../pics/iris_cluster.png" height="450">
|
|
</section>
|
|
|
|
<section>
|
|
<h2>Reproducible execution & provenance capture</h2>
|
|
|
|
<p>datalad run</p>
|
|
<img src="../pics/run_prov.svg" height="600"> <!-- .element: class="fragment" -->
|
|
</section>
|
|
|
|
|
|
<section>
|
|
<h2>Computational reproducibility</h2>
|
|
<ul>
|
|
<li>Code may produce different results or fail with different software</li>
|
|
<li>Datasets can store & share software environments and execute code inside of the software container</li>
|
|
<li>DataLad extension: <code>datalad-container</code></li>
|
|
</ul>
|
|
|
|
<p>datalad-containers run</p>
|
|
<img src="../pics/containers-run.svg" height="600">
|
|
</section>
|
|
|
|
</section>
|
|
|
|
<!--...SUMMARY...-->
|
|
<section>
|
|
<section>
|
|
<h2>How to get started with DataLad</h2>
|
|
<dl>
|
|
<dt>Read <a href="https://handbook.datalad.org"> the DataLad handbook</a></dt>
|
|
<dd>An interactive, hands-on crash-course (free and open source)</dd>
|
|
<dt>Check out or used public DataLad datasets, e.g., from OpenNeuro</dt>
|
|
<dd>
|
|
<pre><code style="max-height:none">$ datalad clone ///openneuro/ds000001
|
|
[INFO ] Cloning http://datasets.datalad.org/openneuro/ds000001 [1 other candidates] into '/tmp/ds000001'
|
|
[INFO ] access to 1 dataset sibling s3-PRIVATE not auto-enabled, enable with:
|
|
| datalad siblings -d "/tmp/ds000001" enable -s s3-PRIVATE
|
|
install(ok): /tmp/ds000001 (dataset)
|
|
|
|
$ cd ds000001
|
|
$ ls sub-01/*
|
|
sub-01/anat:
|
|
sub-01_inplaneT2.nii.gz sub-01_T1w.nii.gz
|
|
|
|
sub-01/func:
|
|
sub-01_task-balloonanalogrisktask_run-01_bold.nii.gz
|
|
sub-01_task-balloonanalogrisktask_run-01_events.tsv
|
|
sub-01_task-balloonanalogrisktask_run-02_bold.nii.gz
|
|
sub-01_task-balloonanalogrisktask_run-02_events.tsv
|
|
sub-01_task-balloonanalogrisktask_run-03_bold.nii.gz
|
|
sub-01_task-balloonanalogrisktask_run-03_events.tsv
|
|
</code></pre>
|
|
</dd>
|
|
|
|
|
|
</dl>
|
|
</section>
|
|
|
|
|
|
<section>
|
|
<h2>Acknowledgements</h2>
|
|
<table>
|
|
<tr style="vertical-align:middle">
|
|
<td style="vertical-align:middle horizontal-align:top" >
|
|
</td>
|
|
<td style="vertical-align:middle">
|
|
<ul>
|
|
<img src="../pics/datalad_logo_wide.svg" height="120">
|
|
<li>Michael Hanke</li>
|
|
<li>Yaroslav Halchenko</li>
|
|
<li>Joey Hess (git-annex)</li>
|
|
<li>Benjamin Poldrack</li>
|
|
<li>Kyle Meyer</li>
|
|
<li>22+ additional contributors</li>
|
|
</ul>
|
|
</td>
|
|
<td style="vertical-align:middle horizontal-align:top" >
|
|
<ul>
|
|
<dt>The DataLad Handbook</dt>
|
|
<li>Laura Waite</li>
|
|
<li>Michael Hanke</li>
|
|
<li>17+ additional contributors</li>
|
|
<img src="../pics/logo.svg" height="200">
|
|
</ul>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
Reach out, get to know the team, contribute: <br>
|
|
<a href="https://matrix.to/#/!SaWRuXhTcCDulfttET:matrix.org?via=matrix.org&via=inm7.de" target="_blank"> DataLad on Riot</a>,
|
|
<br>
|
|
<a href="https://github.com/datalad-handbook/book" target="_blank">DataLad Handbook @ Github</a>
|
|
</section>
|
|
|
|
<section>
|
|
<h3>Thank you!</h3>
|
|
<h1>Questions?</h1>
|
|
</section>
|
|
</section>
|
|
|
|
|
|
</div>
|
|
</div>
|
|
|
|
<script src="../reveal.js/dist/reveal.js"></script>
|
|
<script src="../reveal.js/plugin/notes/notes.js"></script>
|
|
<script src="../reveal.js/plugin/markdown/markdown.js"></script>
|
|
<script src="../reveal.js/plugin/highlight/highlight.js"></script>
|
|
<script>
|
|
// More info about initialization & config:
|
|
// - https://revealjs.com/initialization/
|
|
// - https://revealjs.com/config/
|
|
Reveal.initialize({
|
|
hash: true,
|
|
// The "normal" size of the presentation, aspect ratio will be preserved
|
|
// when the presentation is scaled to fit different resolutions. Can be
|
|
// specified using percentage units.
|
|
width: 1280,
|
|
height: 960,
|
|
// Factor of the display size that should remain empty around the content
|
|
margin: 0.3,
|
|
// Bounds for smallest/largest possible scale to apply to content
|
|
minScale: 0.2,
|
|
maxScale: 1.0,
|
|
|
|
controls: true,
|
|
progress: true,
|
|
history: true,
|
|
center: true,
|
|
slideNumber: 'c',
|
|
pdfSeparateFragments: false,
|
|
pdfMaxPagesPerSlide: 1,
|
|
pdfPageHeightOffset: -1,
|
|
transition: 'slide', // none/fade/slide/convex/concave/zoom
|
|
// Learn about plugins: https://revealjs.com/plugins/
|
|
plugins: [ RevealMarkdown, RevealHighlight, RevealNotes ]
|
|
});
|
|
</script>
|
|
</body>
|
|
</html>
|