1849 lines
74 KiB
HTML
1849 lines
74 KiB
HTML
<!doctype html>
|
||
<html>
|
||
<head>
|
||
<meta charset="utf-8">
|
||
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no">
|
||
|
||
<!-- Edit me start! -->
|
||
<title>DataLad 4 SFB 1280</title>
|
||
<meta name="description" content=" Virtual DataLad course for the SFB 1280 Bochum/Essen/Dortmund ">
|
||
<meta name="author" content=" Adina Wagner ">
|
||
<!-- Edit me end! -->
|
||
|
||
<link rel="stylesheet" href="../reveal.js/dist/reset.css">
|
||
<link rel="stylesheet" href="../reveal.js/dist/reveal.css">
|
||
<link rel="stylesheet" href="../reveal.js/dist/theme/beige.css">
|
||
<link rel="stylesheet" href="../css/main.css">
|
||
<!-- Theme used for syntax highlighted code -->
|
||
<link rel="stylesheet" href="../reveal.js/plugin/highlight/monokai.css">
|
||
</head>
|
||
<body>
|
||
<div class="reveal">
|
||
<div class="slides">
|
||
|
||
|
||
<section>
|
||
|
||
<section>
|
||
<script src="https://cdn.logwork.com/widget/countdown.js"></script>
|
||
<a href="https://logwork.com/countdown-2zu8" class="countdown-timer"
|
||
data-style="columns" data-timezone="Europe/Berlin" data-date="2023-09-28 13:00">
|
||
Workshop starts in
|
||
</a>
|
||
Have a ☕!
|
||
</section>
|
||
|
||
<section>
|
||
<h2>Research data management<br />👩💻👨💻<br />with DataLad</h2>
|
||
<div style="margin-top:1em;text-align:center">
|
||
<table style="border: none;">
|
||
<tr>
|
||
<td>
|
||
Adina Wagner<br><small><a href="https://mas.to/@adswa" target="_blank">
|
||
<img data-src="../pics/mastodon.svg" style="height:30px;margin:0px" /> mas.to/@adswa</a></small>
|
||
</td>
|
||
<td>
|
||
<br>
|
||
</td>
|
||
</tr>
|
||
<tr>
|
||
<td>
|
||
<img style="height:70px;margin-right:10px" data-src="../pics/fzj_logo.svg" /><br>
|
||
</td>
|
||
<td style="vertical-align:top">
|
||
<small><a href="http://psychoinformatics.de" target="_blank">Psychoinformatics lab</a>,
|
||
<br> Institute of Neuroscience and Medicine (INM-7)<br>
|
||
Research Center Jülich</small><br>
|
||
</td>
|
||
</tr>
|
||
</table>
|
||
</div>
|
||
|
||
<br><br><small>
|
||
Interactive Slides: <a href="https://files.inm7.de/adina/talks/html/sfb-1280.html" target="_blank">files.inm7.de/adina/talks/html/sfb-1280.html</a><br>
|
||
PDF for download: <a href="https://files.inm7.de/adina/talks/pdfs/sfb-1280.pdf" target="_blank">files.inm7.de/adina/talks/pdfs/sfb-1280.pdf</a><br>
|
||
Sources: <a href="https://github.com/datalad-handbook/datalad-course/blob/main/html/sfb-1280.html" target="_blank">
|
||
https://github.com/datalad-handbook/datalad-course</a></small>
|
||
</section>
|
||
|
||
</section>
|
||
|
||
<!--...INTRODUCTION AND LOGISTICS (30 Mins)...-->
|
||
|
||
<section>
|
||
|
||
<section>
|
||
<h2>Welcome & Logistics!</h2>
|
||
<ul style="font-size:35px">
|
||
<li class="fragment fade-in-then-semi-out">
|
||
A approximate schedule for today:
|
||
<ul>
|
||
<li>1.00 pm: Introduction & Logistics</li>
|
||
<li>1.30 pm: Overview of DataLad + break ☕</li>
|
||
<li>2.00 pm: What's version control, and why should I care?</li>
|
||
<li>2:45 pm: Reproducibility features + break</li>
|
||
<li>3.30 pm: Data publication to the OSF + break ☕</li>
|
||
<li>4.30 pm: Outlook and/or Your Questions and Usecases</li>
|
||
</ul>
|
||
</li>
|
||
<li class="fragment fade-in-then-semi-out">
|
||
Collaborative notes & anonymous questions: <a href="https://etherpad.wikimedia.org/p/Datalad@sfb1280" target="_blank">
|
||
etherpad.wikimedia.org/p/Datalad@sfb1280</a>.
|
||
</li>
|
||
<li class="fragment fade-in-then-semi-out">
|
||
Slides are CC-BY and will be shared after the workshop. Additional
|
||
workshop contents: <a href="https://psychoinformatics-de.github.io/rdm-course/" target="_blank">
|
||
psychoinformatics-de.github.io/rdm-course</a>
|
||
</li>
|
||
<li class="fragment fade-in-then-semi-out">
|
||
Some guidelines for the virtual workshop venue...
|
||
</li>
|
||
<ul>
|
||
<li class="fragment fade-in">
|
||
Please mute yourself when you don't speak
|
||
</li>
|
||
<li class="fragment fade-in">
|
||
Ask questions anytime, but make use of the "Raise hand" feature
|
||
</li>
|
||
<li class="fragment fade-in">
|
||
Drop out and re-join as you please
|
||
</li>
|
||
</ul>
|
||
</ul>
|
||
</section>
|
||
|
||
<section>
|
||
<h2>Questions/interaction throughout the workshop</h2>
|
||
<ul style="font-size:35px">
|
||
<li>
|
||
There are no stupid questions :)
|
||
</li>
|
||
<li>
|
||
Lively discussions are wonderful - unless its interrupting others,
|
||
please feel encouraged to unmute/turn on your video to interact.
|
||
</li>
|
||
<li>
|
||
There is room discuss specific or advanced use cases at the end. Please make a note about them in
|
||
the <a href="https://etherpad.wikimedia.org/p/Datalad@sfb1280" target="_blank">Etherpad</a>.
|
||
</li>
|
||
</ul>
|
||
</section>
|
||
|
||
<section>
|
||
<h2>Questions/interaction after the workshop</h2>
|
||
<ul>
|
||
If you have a question after the workshop, you can reach out for help:<br>
|
||
<ul style="font-size:30px">
|
||
<dt>Reach out to to the <b>DataLad</b> team via</dt>
|
||
<li>
|
||
<a href="https://matrix.to/#/!NaMjKIhMXhSicFdxAj:matrix.org?via=matrix.waite.eu&via=matrix.org&via=inm7.de" target="_blank">
|
||
Matrix</a> (free, decentralized communication app, no app needed).
|
||
We run a weekly Zoom office hour (Tuesday, 4pm Berlin time) from this room as well.
|
||
</li>
|
||
<li>
|
||
<a href="https://github.com/datalad/datalad" target="_blank">
|
||
the development repository on GitHub</a>
|
||
</li><br>
|
||
<dt>Reach out to the user community with</dt>
|
||
<li>
|
||
A question on <a href="https://neurostars.org/" target="_blank">neurostars.org</a>
|
||
with a <code>datalad</code> tag
|
||
</li><br>
|
||
<dt>Find more user tutorials or workshop recordings</dt>
|
||
<li>
|
||
On <a href="https://www.youtube.com/datalad" target="_blank">
|
||
DataLad's YouTube channel</a>
|
||
</li>
|
||
<li>
|
||
In the <a href="http://handbook.datalad.org/en/latest/" target="_blank">
|
||
DataLad Handbook </a>
|
||
</li>
|
||
<li>
|
||
In the <a href="https://psychoinformatics-de.github.io/rdm-course/" target="_blank">DataLad RDM course</a>
|
||
</li>
|
||
<li>
|
||
In the <a href="http://docs.datalad.org" target="_blank">Official API documentation</a>
|
||
</li>
|
||
</ul>
|
||
</ul>
|
||
</section>
|
||
|
||
<section>
|
||
<h2>Resources and Further Reading</h2>
|
||
<table style="font-size:30px">
|
||
<tr>
|
||
<td>
|
||
Comprehensive user documentation in the<br>
|
||
DataLad Handbook
|
||
<a href="http://handbook.datalad.org" target="_blank">(handbook.datalad.org)</a>
|
||
</td>
|
||
<td>
|
||
<img src="../pics/logo.svg" height="150">
|
||
</td>
|
||
</tr>
|
||
</table>
|
||
|
||
<table style="font-size:30px">
|
||
<tr>
|
||
<td><img src="../pics/artwork/src/enter.svg" height="100"></a></td>
|
||
<td>
|
||
<ul>
|
||
<li>High-level function/command overviews, <br>
|
||
Installation, Configuration, Cheatsheet
|
||
</li>
|
||
</ul>
|
||
</td>
|
||
</tr>
|
||
<tr>
|
||
<td><img src="../pics/artwork/src/basics.svg" height="100"></td>
|
||
<td>
|
||
<ul>
|
||
<li>Narrative-based code-along course</li>
|
||
<li>Independent on background/skill level, <br>
|
||
suitable for data management novices
|
||
</li>
|
||
</ul>
|
||
</td>
|
||
</tr>
|
||
<tr>
|
||
<td><img src="../pics/artwork/src/usecases.svg" height="100"></td>
|
||
<td>
|
||
<ul>
|
||
<li>Step-by-step solutions to common <br>
|
||
data management problems, like<br />how to
|
||
make a reproducible paper
|
||
</li>
|
||
</ul>
|
||
</td>
|
||
</tr>
|
||
</table>
|
||
<p style="font-size:30px">
|
||
Overview of most tutorials, talks, videos, ... at
|
||
<a href="https://github.com/datalad/tutorials" target="_blank">
|
||
github.com/datalad/tutorials</a>
|
||
</p>
|
||
</section>
|
||
|
||
<section>
|
||
<h2>Live polling system</h2>
|
||
Please use your phone to scan to QR code, or open the link in a new browser window <br>
|
||
<iframe src="https://directpoll.com/r?XDbzPBd3ixYqg84Gif8nU69RJWPkCXwpVvMnElD",
|
||
style="border: 0" width="900" height="800"></iframe>
|
||
</section>
|
||
|
||
<section>
|
||
<h2>What's your mood today?</h2>
|
||
<img src="../pics/sheepscale.png" height="600"><iframe src="https://directpoll.com/r?XDbzPBd3ixYqg84Gif8nU69RJWPkCXwpVvMnElD",
|
||
style="border: 0" width="400" height="600"></iframe>
|
||
</section>
|
||
|
||
<section>
|
||
<h2>Practical aspects</h2>
|
||
<img width="200" src="../pics/jupyter_logo.png" alt="jupyterlogo"><br>
|
||
<ul>
|
||
<li>
|
||
We'll work in the browser on a cloud server with JupyterHub
|
||
</li>
|
||
<li class="fragment">
|
||
Cloud-computing environment:<br>
|
||
- <a href="https://datalad-hub.inm7.de">datalad-hub.inm7.de</a>
|
||
</li>
|
||
<li class="fragment">
|
||
We have pre-installed DataLad and other requirements
|
||
</li>
|
||
<li class="fragment">
|
||
We will work via the terminal
|
||
</li>
|
||
<li class="fragment">
|
||
Your username is all lower-case and follows this pattern: Firstname + Lastname initial (Adina Wagner -> adinaw)
|
||
</li>
|
||
<li class="fragment">
|
||
Pick any password with at least 8 characters at first log-in (and remember it)
|
||
</li>
|
||
</ul>
|
||
<p class="fragment"> Please try to log in now</p>
|
||
</section>
|
||
|
||
<section data-transition="None">
|
||
<h2>Prerequisites: Using DataLad</h2>
|
||
<ul style="font-size:30px">
|
||
<li>Every DataLad command consists of a main
|
||
command followed by a sub-command. The main and the sub-command can have options.
|
||
<img height="280px" src="../pics/command-structure.png">
|
||
</li>
|
||
<li> Example (main command, subcommand, several subcommand options):
|
||
<pre><code>$ datalad save -m "Saving changes" --recursive </code></pre>
|
||
</li>
|
||
<li>
|
||
Use <em>--help</em> to find out more about any (sub)command and its
|
||
options, including detailed description and examples (<em>q</em> to close).
|
||
Use <em>-h</em> to get a short overview of all options
|
||
<pre><code>$ datalad save -h
|
||
Usage: datalad save [-h] [-m MESSAGE] [-d DATASET] [-t ID] [-r] [-R LEVELS]
|
||
[-u] [-F MESSAGE_FILE] [--to-git] [-J NJOBS] [--amend]
|
||
[--version]
|
||
[PATH ...]
|
||
|
||
Use '--help' to get more comprehensive information.
|
||
</code></pre></li>
|
||
</ul>
|
||
</section>
|
||
|
||
<section style="text-align: left;">
|
||
<h3>Using DataLad in the Terminal</h3>
|
||
Check the installed version:
|
||
<pre style="margin-left: 0;">
|
||
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
||
datalad --version
|
||
</code>
|
||
<p id="displayArea"></p>
|
||
</pre>
|
||
|
||
<div class="fragment">
|
||
For help on using DataLad from the command line (press q to exit):
|
||
<pre style="margin-left: 0;">
|
||
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
||
datalad --help
|
||
</code>
|
||
</pre>
|
||
</div>
|
||
|
||
<div class="fragment">
|
||
For extensive info about the installed package, its dependencies, and extensions, use <code>datalad wtf</code>.
|
||
Let's find out what kind of system we're on:
|
||
<pre style="margin-left: 0;">
|
||
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
||
datalad wtf -S system
|
||
</code>
|
||
</pre>
|
||
</div>
|
||
</section>
|
||
|
||
<section style="text-align: left;">
|
||
<h3>git identity</h3>
|
||
Check git identity:
|
||
<pre style="margin-left: 0;">
|
||
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
||
git config --get user.name
|
||
git config --get user.email
|
||
</code>
|
||
</pre>
|
||
|
||
<div class="fragment">
|
||
Configure git identity:
|
||
<pre style="margin-left: 0;">
|
||
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
||
git config --global user.name "Adina Wagner"
|
||
git config --global user.email "adina.wagner@t-online.de"
|
||
</code>
|
||
</pre>
|
||
</div>
|
||
|
||
<div class="fragment">
|
||
Use the latest datalad features:
|
||
<pre style="margin-left: 0;">
|
||
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
||
git config --global --add datalad.extensions.load next
|
||
</code>
|
||
</pre>
|
||
</div>
|
||
</section>
|
||
|
||
<section style="text-align: left;">
|
||
<h3>Using datalad via its Python API</h3>
|
||
Open a Python environment:
|
||
<pre style="margin-left: 0;">
|
||
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
||
ipython
|
||
</code>
|
||
</pre>
|
||
<div class="fragment">
|
||
Import and start using:
|
||
<pre style="margin-left: 0;">
|
||
<code data-trim class="language-python" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
||
import datalad.api as dl
|
||
dl.create(path='mydataset')
|
||
</code>
|
||
</pre>
|
||
</div>
|
||
<div class="fragment">
|
||
Exit the Python environment:
|
||
<pre style="margin-left: 0;">
|
||
<code data-trim class="language-python" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
||
exit
|
||
</code>
|
||
</pre>
|
||
</div>
|
||
</section>
|
||
|
||
<section data-transition="None">
|
||
<h2>Different ways to use DataLad</h2>
|
||
<ul>
|
||
<div>
|
||
<li>DataLad can be used from the command line</li>
|
||
<pre><code>datalad create mydataset</code></pre>
|
||
</div>
|
||
<div class="fragment fade-in">
|
||
<li>... or with its Python API</li>
|
||
<pre><code class="python">import datalad.api as dl
|
||
dl.create(path="mydataset")</code></pre>
|
||
</div>
|
||
<div class="fragment fade-in">
|
||
<li>... and other programming languages can use it via system call</li>
|
||
<pre><code class="python"># in R
|
||
> system("datalad create mydataset")</code></pre>
|
||
</div>
|
||
<li class="fragment fade-in">... or via a graphical user interface
|
||
<a href="https://github.com/datalad/datalad-gooey" target="_blank">"DataLad Gooey"</a>
|
||
</li>
|
||
<br><br>
|
||
</ul>
|
||
</section>
|
||
|
||
</section>
|
||
|
||
<!----------- OVERVIEW OF DATALAD ---------->
|
||
|
||
<section>
|
||
|
||
<section>
|
||
<h2>Acknowledgements</h2>
|
||
<table>
|
||
<tr style="vertical-align:top">
|
||
<td style="vertical-align:top">
|
||
<dl>
|
||
<dt>Software</dt>
|
||
<dd style="margin-left:5px!important">
|
||
<ul style="margin-left:5px!important">
|
||
<li>Joey Hess (git-annex)</li>
|
||
<li>The DataLad team &
|
||
contributors</li>
|
||
</ul>
|
||
</dd>
|
||
<dt style="margin-top:20px">Illustrations </dt>
|
||
<dd style="margin-left:5px!important">
|
||
<ul style="margin-left:5px!important">
|
||
<li>The Turing Way <br>
|
||
project & Scriberia</li>
|
||
<img src="../pics/bannerthanks.svg">
|
||
</ul>
|
||
</dd>
|
||
</dl>
|
||
</td>
|
||
<td style="vertical-align:top">
|
||
<div style="margin-bottom:-20px;text-align:center"><strong>Funders</strong></div>
|
||
<img style="height:150px;margin-right:50px" data-src="../pics/nsf_2020.png" />
|
||
<img style="height:150px;margin-right:50pxi;margin-left:50px" data-src="../pics/binc.png" />
|
||
<img style="height:150px;margin-left:50px" data-src="../pics/bmbf_2020.png" />
|
||
<img style="height:80px;margin-top:-40px;margin-left:auto;margin-right:auto;width:100%" data-src="../pics/fzj_logo.svg" />
|
||
<div style="margin-top:-20px">
|
||
<img style="height:60px;margin-right:20px" data-src="../pics/erdf.png" />
|
||
<img style="height:60px;margin-right:20px" data-src="../pics/cbbs_logo.png" />
|
||
<img style="height:60px" data-src="../pics/LSA-Logo.png" />
|
||
</div>
|
||
<div style="margin-top:40px;margin-bottom:20px;text-align:center"><strong>Collaborators</strong></div>
|
||
<div style="margin-top:-20px">
|
||
<img style="height:100px;margin:20px" data-src="../pics/hbp_logo.png" />
|
||
<img style="height:100px;margin:20px" data-src="../pics/conp_logo.png" />
|
||
<img style="height:100px;margin:20px" data-src="../pics/vbc_logo.png" />
|
||
</div>
|
||
<div style="margin-top:-40px">
|
||
<img style="height:120px;margin:20px" data-src="../pics/openneuro_logo.png" />
|
||
<img style="height:120px;margin:20px" data-src="../pics/cbrain_logo.png" />
|
||
<img style="height:140px;margin:20px" data-src="../pics/brainlife_logo.png" />
|
||
</div>
|
||
</td>
|
||
</tr>
|
||
</table>
|
||
</section>
|
||
|
||
<section>
|
||
<h2><img src="../pics/datalad_logo_wide.svg" height="150">Core Features:</h2>
|
||
<ul>
|
||
<li class="fragment fade-in-then-semi-out">
|
||
Joint <b>version control</b> (<a href="https://git-scm.com/" target="_blank">Git</a>,
|
||
<a href="https://git-annex.branchable.com/" target="_blank">git-annex</a>): version control data & software alongside your code
|
||
</li>
|
||
<li class="fragment fade-in-then-semi-out">
|
||
<b>Provenance capture</b>:
|
||
Create and share machine-readable, re-executable provenance records for reproducible, transparent, and FAIR research
|
||
</li>
|
||
<li class="fragment fade-in-then-semi-out">
|
||
Decentral <b>data transport</b> mechanisms:
|
||
Install, share and collaborate on scientific projects; publish,
|
||
update, and retrieve their contents in a streamlined fashion on demand,
|
||
and distribute files in a decentral network on the services or infrastructures
|
||
of your choice
|
||
</li>
|
||
</ul><br>
|
||
</section>
|
||
|
||
<section data-transition="None">
|
||
<h3>Examples of what DataLad can be used for:</h3>
|
||
<ul>
|
||
<li class="fragment fade-in-then-semi-out">
|
||
<b>Publish or consume datasets</b>
|
||
via GitHub, GitLab, OSF, the European Open Science Cloud, or similar services
|
||
</li>
|
||
</ul>
|
||
<img height="700" class="fragment fade-in" src="../pics/getdata_studyforrest.gif" alt="a screenrecording of cloning studyforrest data from github">
|
||
</section>
|
||
|
||
<section data-transition="None">
|
||
<h3>Examples of what DataLad can be used for:</h3>
|
||
<ul>
|
||
<li class="fragment fade-in-then-semi-out">
|
||
Behind-the-scenes <b>infrastructure component for data transport and versioning</b>
|
||
(e.g., used by <a href="https://openneuro.org/" target="_blank"> OpenNeuro</a>,
|
||
<a href="https://brainlife.io/" target="_blank"> brainlife.io </a>,
|
||
the <a href="https://conp.ca/" target="_blank">Canadian Open Neuroscience Platform (CONP)</a>,
|
||
<a href="https://mcin.ca/technology/cbrain/" target="_blank"> CBRAIN</a>)
|
||
</li>
|
||
</ul>
|
||
<img height="700" class="fragment fade-in" src="../pics/openneuro_new_2.gif" alt="a screenrecording of browsing open neuro">
|
||
</section>
|
||
|
||
<section data-transition="None">
|
||
<h3>Examples of what DataLad can be used for:</h3>
|
||
<ul>
|
||
<li class="fragment fade-in-then-semi-out">
|
||
<b>Creating and sharing reproducible, open science</b>: Sharing data, software, code, and provenance
|
||
</li>
|
||
</ul>
|
||
<img height="700" class="fragment fade-in" src="../pics/remodnavpaper_2.gif" alt="a screenrecording of cloning REMODNAV paper dataset from github">
|
||
</section>
|
||
|
||
<section data-transition="None">
|
||
<h3>Examples of what DataLad can be used for:</h3>
|
||
<ul>
|
||
<li>
|
||
<b>Creating and sharing reproducible, open science</b>: Sharing data, software, code, and provenance
|
||
</li>
|
||
<img height="800" class="fragment fade-in" src="../pics/openscience.gif" alt="a screenrecording of cloning REMODNAV paper dataset from github">
|
||
</ul>
|
||
</section>
|
||
|
||
<section data-transition="None">
|
||
<h3>Examples of what DataLad can be used for:</h3>
|
||
<ul>
|
||
<li class="fragment fade-in-then-semi-out"><b>Central data management</b> and archival system</li>
|
||
</ul>
|
||
<img height="700" class="fragment fade-in" src="../pics/centralmanagement2.gif">
|
||
</section>
|
||
|
||
<section data-transition="None">
|
||
<h3>Examples of what DataLad can be used for:</h3>
|
||
<ul>
|
||
<li class="fragment fade-in-then-semi-out">
|
||
<b>Scalable computing framework</b> for reproducible science
|
||
</li>
|
||
<img height="350" class="fragment fade-in" src="../pics/fairly-big.png">
|
||
<img height="500" class="fragment fade-in" src="../pics/ukb_datasets.svg">
|
||
</ul>
|
||
</section>
|
||
|
||
<section><script src="https://cdn.logwork.com/widget/countdown.js"></script>
|
||
<a href="https://logwork.com/countdown-2zu8" class="countdown-timer"
|
||
data-style="columns" data-timezone="Europe/Berlin" data-date="2023-09-28 14:00">
|
||
Quick break
|
||
</a><br>
|
||
we're back shortly
|
||
</section>
|
||
|
||
</section>
|
||
|
||
<!----- WHAT'S VERSION CONTROL, AND WHY SHOULD I CARE? ----->
|
||
|
||
<section>
|
||
|
||
<section>
|
||
<h2>What's version control, and why should I care?</h2><br>
|
||
<iframe src="https://directpoll.com/r?XDbzPBd3ixYqg84Gif8nU69RJWPkCXwpVvMnElD",
|
||
style="border: 0" width="900" height="800"></iframe>
|
||
</section>
|
||
|
||
|
||
<section>
|
||
<h2>Everything happens in DataLad datasets</h2>
|
||
<img src="../pics/artwork/src/dataset_extended.svg" width="800"> <br><br><br>
|
||
<table class="fragment fade-in-then-semi-out" >
|
||
<tr>
|
||
<td style="vertical-align:middle">
|
||
<ul style="font-size:30px">
|
||
<li>Look and feel like a directory on your computer</li>
|
||
<li>content agnostic</li>
|
||
<li>no custom data structures</li>
|
||
<img src="../pics/remodnav-ds-terminal.png" width="500"><br><small><br>Terminal view</small>
|
||
</ul>
|
||
</td>
|
||
<td style="font-size:30px; vertical-align:top">
|
||
<img src="../pics/remodnav-ds-nautilus.png" width="500"><br>
|
||
<small>File viewer</small>
|
||
</td>
|
||
</tr>
|
||
</table>
|
||
</section>
|
||
|
||
<section style="text-align: left;">
|
||
<h3>...Datalad datasets</h3>
|
||
Create a dataset (here, with the <code>text2git</code> configuration, which adds
|
||
a helpful configuration): <br>
|
||
<pre style="margin-left: 0;">
|
||
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
||
datalad create -c text2git my-analysis
|
||
</code>
|
||
</pre>
|
||
|
||
<div class="fragment">
|
||
Let's have a look inside. Navigate using <code>cd</code> (change directory):
|
||
<pre style="margin-left: 0;">
|
||
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
||
cd my-analysis
|
||
</code>
|
||
</pre>
|
||
</div>
|
||
|
||
<div class="fragment">
|
||
List the directory content, including hidden files, with <code>ls</code>:
|
||
<pre style="margin-left: 0;">
|
||
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
||
ls -la .
|
||
</code>
|
||
</pre>
|
||
</div>
|
||
</section>
|
||
|
||
<section data-transition="None">
|
||
<h2>Dataset = Git/git-annex repository</h2>
|
||
<li>version control files regardless of size or type</li>
|
||
<img src="../pics/artwork/src/local_wf.svg" width="600"> <br>
|
||
<ul>
|
||
<p class="fragment fade-in">
|
||
Stay flexible:
|
||
<li class="fragment fade-in">
|
||
Non-complex DataLad core API (easy for data management novices)
|
||
</li>
|
||
<li class="fragment fade-in">
|
||
Pure Git or git-annex commands (for regular Git or git-annex users, or to use specific functionality)
|
||
</li>
|
||
</p>
|
||
</ul>
|
||
</section>
|
||
|
||
<section style="text-align: left;">
|
||
<h3>...Version control</h3>
|
||
Let’s build a dataset for an analysis by adding a README. The command below writes a simple header into a new file README.md:
|
||
<pre style="margin-left: 0;">
|
||
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
||
echo "# My example DataLad dataset" > README.md
|
||
</code>
|
||
</pre>
|
||
|
||
<div class="fragment">
|
||
Now we can check the <code>status</code> of the dataset:
|
||
<pre style="margin-left: 0;">
|
||
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
||
datalad status
|
||
</code>
|
||
</pre>
|
||
</div>
|
||
|
||
<div class="fragment">
|
||
We can save the state with <code>save</code>
|
||
<pre style="margin-left: 0;">
|
||
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
||
datalad save -m "Create a short README"
|
||
</code>
|
||
</pre>
|
||
</div>
|
||
|
||
<div class="fragment">
|
||
Further modifications:
|
||
<pre style="margin-left: 0;">
|
||
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
||
echo "This dataset contains a toy data analysis" >> README.md
|
||
</code>
|
||
</pre>
|
||
</div>
|
||
|
||
<div class="fragment">
|
||
You can also checkout what has changed:
|
||
<pre style="margin-left: 0;">
|
||
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
||
git diff
|
||
</code>
|
||
</pre>
|
||
</div>
|
||
|
||
<div class="fragment">
|
||
Save again:
|
||
<pre style="margin-left: 0;">
|
||
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
||
datalad save -m "Add information on the dataset contents to the README"
|
||
</code>
|
||
</pre>
|
||
</div>
|
||
</section>
|
||
|
||
<section style="text-align: left;">
|
||
<h3>...Version control</h3>
|
||
<div class="fragment">
|
||
Now, let's check the dataset history:
|
||
<pre style="margin-left: 0;">
|
||
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
||
git log
|
||
</code>
|
||
</pre>
|
||
</div>
|
||
|
||
<div class="fragment">
|
||
We can also make the history prettier:
|
||
<pre style="margin-left: 0;">
|
||
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
||
tig
|
||
</code>
|
||
(navigate with arrow keys and enter, press "q" to go back and exit the program)
|
||
</pre>
|
||
</div>
|
||
</section>
|
||
|
||
<section data-transition="None">
|
||
<h2>Exhaustive tracking</h2>
|
||
<dl style="font-size:35px">
|
||
<dt>The building blocks of a scientific result are rarely static</dt>
|
||
<table>
|
||
<tr>
|
||
<td style="vertical-align:middle">Analysis code evolves<br>
|
||
<small>(Fix bugs, add functions, refactor, ...)</small>
|
||
</td>
|
||
<td>
|
||
<img src="../pics/final.png" height="500">
|
||
<imgcredit>Based on Piled Higher and Deeper
|
||
<a href="https://phdcomics.com/comics/archive_print.php?comicid=1531" target="_blank">1531
|
||
</a>
|
||
</imgcredit></td>
|
||
</tr>
|
||
</table>
|
||
</dl>
|
||
</section>
|
||
|
||
<section data-transition="None">
|
||
<h2>Exhaustive tracking</h2>
|
||
<dl style="font-size:35px">
|
||
<dt>The building blocks of a scientific result are rarely static</dt>
|
||
<table>
|
||
<tr>
|
||
<td style="vertical-align:middle">Data changes <br>
|
||
<small>(errors are fixed, data is extended,<br>
|
||
naming standards change, an analysis <br>
|
||
requires only a subset of your data...)</small></td>
|
||
<td><img src="../pics/phd052810s.png" height="500">
|
||
<imgcredit>Piled Higher and Deeper
|
||
<a href="https://phdcomics.com/comics/archive_print.php?comicid=1323" target="_blank">1323
|
||
</a>
|
||
</imgcredit>
|
||
</td>
|
||
</tr>
|
||
</table>
|
||
</dl>
|
||
</section>
|
||
|
||
<section data-transition="None">
|
||
<h2>Exhaustive tracking</h2>
|
||
<dl style="font-size:35px">
|
||
<dt>The building blocks of a scientific result are rarely static</dt><br>
|
||
</dl>
|
||
<table>
|
||
<tr>
|
||
<td style="vertical-align: top">
|
||
Data changes (for real) <br>
|
||
<small>(errors are fixed, data is extended,<br>
|
||
naming standards change, ...)</small>
|
||
<img height="180px" src="../pics/abcdtwitter.png">
|
||
</td>
|
||
<td>
|
||
<img width="1000px" src="../pics/abcd.png">
|
||
</td>
|
||
</tr>
|
||
</table>
|
||
</section>
|
||
|
||
<section data-transition="None">
|
||
<h2>Exhaustive tracking</h2>
|
||
"Shit, which version of which script produced these outputs from which version
|
||
of what data... and which software version?"<br>
|
||
<img src="../pics/manuallabor.png">
|
||
<img src="../pics/findfiles.png" height="400">
|
||
<img src="../pics/projectstack.png" height="350">
|
||
<imgcredit>CC-BY Scriberia and <a href="https://the-turing-way.netlify.app/reproducible-research/rdm.html" target="_blank">
|
||
The Turing Way</a>
|
||
</imgcredit>
|
||
</section>
|
||
|
||
|
||
<section data-transition="None">
|
||
<h3>Exhaustive tracking</h3>
|
||
Once you track changes to data with version control tools,
|
||
you can find out <em>why</em> it changed, <em>what</em> has changed, <em>when</em> it changed,
|
||
and <em>which version</em> of your data was used at which point in time.
|
||
<div class="r-stack">
|
||
<img class="fragment fade-out" data-fragment-index="1" src="../pics/tigdata.png">
|
||
<img class="fragment" data-fragment-index="1" src="../pics/tigdata3.png">
|
||
<img class="fragment" src="../pics/tigdata2.png">
|
||
</div>
|
||
</section>
|
||
|
||
<section style="text-align: left;">
|
||
<h3>Exhaustive tracking</h3>
|
||
<div class="fragment">
|
||
With the <code>datalad-container</code> extension, we can not only add code or data, but also
|
||
software containers to datasets and work with them.
|
||
Let's add a software container with Python software for later:
|
||
<pre style="margin-left: 0;">
|
||
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">datalad containers-add nilearn \
|
||
--url shub://adswa/nilearn-container:latest
|
||
</code>
|
||
</pre>
|
||
</div>
|
||
|
||
<div class="fragment">
|
||
inspect the list of registered containers:
|
||
<pre style="margin-left: 0;">
|
||
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
||
datalad containers-list
|
||
</code>
|
||
</pre>
|
||
</div>
|
||
</section>
|
||
|
||
</section>
|
||
|
||
<!-- REPRODUCIBILITY FEATURES -->
|
||
|
||
|
||
<section>
|
||
|
||
|
||
<section>
|
||
<h2>Digital provenance</h2>
|
||
<ul>
|
||
<p >
|
||
= <i>"The tools and processes used to create a
|
||
digital file, the responsible entity, and when and where the process
|
||
events occurred"</i>
|
||
</p>
|
||
<li class="fragment fade-in">
|
||
Have you ever saved a PDF to read later onto your computer, but forgot
|
||
where you got it from? Or did you ever find a figure in your project,
|
||
but forgot which analysis step produced it?
|
||
</li>
|
||
</ul>
|
||
</section>
|
||
|
||
<section style="text-align: left;">
|
||
<h3>Digital provenance</h3>
|
||
<div class="fragment">
|
||
Imagine that you are getting a script from a colleague to perform your analysis, but they email it to you or upload it to a random place for to download:
|
||
<pre style="margin-left: 0;">
|
||
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">wget -P code/ \
|
||
https://raw.githubusercontent.com/datalad-handbook/resources/master/get_brainmask.py
|
||
</code>
|
||
</pre>
|
||
</div>
|
||
|
||
<div class="fragment">
|
||
The <code>wget</code> command downloaded a script for extracting a brain mask:
|
||
<pre style="margin-left: 0;">
|
||
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
||
datalad status
|
||
</code>
|
||
</pre>
|
||
</div>
|
||
|
||
<div class="fragment">
|
||
Save it into your dataset to have the script ready:
|
||
<pre style="margin-left: 0;">
|
||
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
||
datalad save -m "Adding a nilearn-based script for brain masking"
|
||
</code>
|
||
</pre>
|
||
</div>
|
||
|
||
<div class="fragment">
|
||
Convenience functions make downloads easier. Let's add a nilearn tutorial, and also register the original location of this file as digital provenance:
|
||
<pre style="margin-left: 0;">
|
||
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">datalad download-url -m "Add a tutorial on nilearn" \
|
||
-O code/nilearn-tutorial.pdf \
|
||
https://raw.githubusercontent.com/datalad-handbook/resources/master/nilearn-tutorial.pdf
|
||
</code>
|
||
</pre>
|
||
</div>
|
||
|
||
<div class="fragment">
|
||
Notice how its automatically saved:
|
||
<pre style="margin-left: 0;">
|
||
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
||
datalad status
|
||
</code>
|
||
</pre>
|
||
</div>
|
||
|
||
<div class="fragment">
|
||
Check out the file's history:
|
||
<pre style="margin-left: 0;">
|
||
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">git log code/nilearn-tutorial.pdf</code>
|
||
</pre>
|
||
</div>
|
||
</section>
|
||
|
||
<section data-transition="None">
|
||
<h2>Provenance and reproducibility</h2>
|
||
<strong>datalad run</strong> wraps around anything expressed in a command
|
||
line call and saves the dataset modifications resulting from the execution
|
||
<img src="../pics/run_basic.svg" height="600"> <!-- .element: class="fragment" -->
|
||
</section>
|
||
|
||
<section data-transition="None">
|
||
<h2>Provenance and reproducibility</h2>
|
||
<strong>datalad rerun</strong> repeats captured executions. <br>
|
||
If the outcomes
|
||
differ, it saves a new state of them.
|
||
<img src="../pics/rerun.svg" height="350"> <!-- .element: class="fragment" -->
|
||
</section>
|
||
|
||
|
||
<section style="text-align:left;">
|
||
<h3>... Computationally reproducible execution I</h3>
|
||
<div class="fragment">
|
||
A variety of processes can modify files. A simple example: Code formatting
|
||
<pre style="margin-left: 0;">
|
||
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">black code/get_brainmask.py</code>
|
||
</pre>
|
||
</div>
|
||
|
||
<div class="fragment">
|
||
Version control makes changes transparent:
|
||
<pre style="margin-left: 0;">
|
||
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">git diff</code>
|
||
</pre>
|
||
</div>
|
||
|
||
<div class="fragment">
|
||
But its useful to keep track beyond that. Let's discard the latest changes...
|
||
<pre style="margin-left: 0;">
|
||
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">git restore code/get_brainmask.py</code>
|
||
</pre>
|
||
</div>
|
||
|
||
<div class="fragment">
|
||
... and record precisely what we did
|
||
<pre style="margin-left: 0;">
|
||
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">datalad run -m "Reformat code with black" \
|
||
"black code/get_brainmask.py"</code>
|
||
</pre>
|
||
</div>
|
||
|
||
<div class="fragment">
|
||
let's take a look (press q to exit):
|
||
<pre style="margin-left: 0;">
|
||
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">git show</code>
|
||
</pre>
|
||
</div>
|
||
|
||
<div class="fragment">
|
||
... and repeat!
|
||
<pre style="margin-left: 0;">
|
||
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">datalad rerun</code>
|
||
</pre>
|
||
</div>
|
||
</section>
|
||
|
||
<section data-transition="None">
|
||
<h2>Seamless dataset nesting & linkage</h2>
|
||
<img src="../pics/dataflow.jpg">
|
||
<imgcredit><a href="https://www.frontiersin.org/articles/10.3389/fninf.2012.00009/full" target="_blank">
|
||
Poline et al., 2011</a>
|
||
</imgcredit>
|
||
<img src="../pics/artwork/src/linkage_subds.svg" width="900"> <br>
|
||
|
||
<!-- <ul>
|
||
<li class="fragment fade-in" data-fragment-index="2">Overcomes scaling issues with large amounts of files</li>
|
||
<pre class="fragment fade-in" data-fragment-index="2"><code>adina@bulk1 in /ds/hcp/super on git:master❱ datalad status --annex -r
|
||
15530572 annex'd files (77.9 TB recorded total size)
|
||
nothing to save, working tree clean</code></pre>
|
||
<small><a class="fragment fade-in" data-fragment-index="2" href="https://github.com/datalad-datasets/human-connectome-project-openaccess" target="_blank">(github.com/datalad-datasets/human-connectome-project-openaccess)</a></small>
|
||
<li class="fragment fade-in">Modularizes research components for transparency, reuse, and access management</li>
|
||
</ul>
|
||
-->
|
||
</section>
|
||
|
||
<section data-transition="None">
|
||
<h2>Seamless dataset nesting & linkage</h2>
|
||
<img data-src="../pics/linkage.svg" height="300">
|
||
<pre><code class="bash" style="font-size:115%;max-height:none">
|
||
$ datalad clone --dataset . http://example.com/ds inputs/rawdata
|
||
</code></pre>
|
||
|
||
<pre><code class="diff" style="max-height:none">$ git diff HEAD~1
|
||
diff --git a/.gitmodules b/.gitmodules
|
||
new file mode 100644
|
||
index 0000000..c3370ba
|
||
--- /dev/null
|
||
+++ b/.gitmodules
|
||
@@ -0,0 +1,3 @@
|
||
+[submodule "inputs/rawdata"]
|
||
+ path = inputs/rawdata
|
||
+ datalad-id = 68bdb3f3-eafa-4a48-bddd-31e94e8b8242
|
||
+ datalad-url = http://example.com/importantds
|
||
diff --git a/inputs/rawdata b/inputs/rawdata
|
||
new file mode 160000
|
||
index 0000000..fabf852
|
||
--- /dev/null
|
||
+++ b/inputs/rawdata
|
||
@@ -0,0 +1 @@
|
||
+Subproject commit fabf8521130a13986bd6493cb33a70e580ce8572
|
||
</code></pre>
|
||
<aside class="notes">weighs just a few bytes</aside>
|
||
</section>
|
||
|
||
|
||
<section style="text-align: left;">
|
||
<h3>...Dataset nesting</h3>
|
||
|
||
Let's make a nest!
|
||
<div class="fragment">
|
||
Clone a dataset with analysis data into a specific
|
||
location ("input/") in the existing dataset,
|
||
making it a <em>sub</em>dataset:
|
||
<pre style="margin-left: 0;">
|
||
<code class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">datalad clone -d . \
|
||
https://gin.g-node.org/adswa/bids-data \
|
||
input</code>
|
||
</pre>
|
||
</div>
|
||
|
||
<div class="fragment">
|
||
Let's see what changed in the dataset, using the <code>subdatasets</code> command:
|
||
<pre style="margin-left: 0;">
|
||
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
||
datalad subdatasets
|
||
</code>
|
||
</pre>
|
||
</div>
|
||
<div class="fragment">
|
||
... and also <code>git show</code>:
|
||
<pre style="margin-left: 0;">
|
||
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
||
git show
|
||
</code>
|
||
</pre>
|
||
</div>
|
||
</section>
|
||
|
||
<section style="text-align:left;">
|
||
<div class="fragment">
|
||
We can now view the cloned dataset's file tree:
|
||
<pre style="margin-left: 0;">
|
||
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
||
cd input
|
||
ls
|
||
</code>
|
||
</pre>
|
||
</div>
|
||
|
||
<div class="fragment">
|
||
...and also its history
|
||
<pre style="margin-left: 0;">
|
||
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
||
tig
|
||
</code>
|
||
</pre>
|
||
</div>
|
||
|
||
<div class="fragment">
|
||
Let's check the dataset size (with the <code>du</code> disk-usage command):
|
||
<pre style="margin-left: 0;">
|
||
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
||
du -sh
|
||
</code>
|
||
</pre>
|
||
</div>
|
||
|
||
<div class="fragment">
|
||
Let's check the <em>actual</em> dataset size:
|
||
<pre style="margin-left: 0;">
|
||
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
||
datalad status --annex
|
||
</code>
|
||
</pre>
|
||
</div>
|
||
|
||
<div class="fragment">
|
||
You can <code>get</code> or <code>drop</code> annexed file contents depending on your needs:
|
||
<pre style="margin-left: 0;">
|
||
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
||
datalad get sub-02
|
||
</code>
|
||
</pre>
|
||
<pre style="margin-left: 0;">
|
||
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
||
datalad drop sub-02
|
||
</code>
|
||
</pre>
|
||
</div>
|
||
</section>
|
||
|
||
<section style="text-align: left;">
|
||
<h3>...Computationally reproducible execution...</h3>
|
||
|
||
Try to execute the downloaded analysis script. Does it work?
|
||
<div><pre style="margin-left: 0;"><code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
||
cd ..
|
||
datalad run -m "Compute brain mask" \
|
||
--input input/sub-02/func/sub-02_task-oneback_run-01_bold.nii.gz \
|
||
--output "figures/*" \
|
||
--output "sub-02*" \
|
||
"python code/get_brainmask.py"</code></pre></div>
|
||
|
||
<ul class="fragment">
|
||
<li>
|
||
Software can be difficult or impossible to install (e.g. conflicts with existing software,
|
||
or on HPC) for you or your collaborators
|
||
</li>
|
||
<li>
|
||
Different software versions/operating systems can produce different results:
|
||
<a href="https://doi.org/10.3389/fninf.2015.00012" target="_blank">Glatard et al., doi.org/10.3389/fninf.2015.00012</a>
|
||
</li>
|
||
<li class="fragment fade-in">
|
||
<strong>Software containers</strong> encapsulate a software environment and isolate it from
|
||
a surrounding operating system. Two common solutions: Docker, Singularity
|
||
</li>
|
||
</ul>
|
||
</section>
|
||
|
||
<section>
|
||
<h2>Software containers</h2><br>
|
||
<iframe src="https://directpoll.com/r?XDbzPBd3ixYqg84Gif8nU69RJWPkCXwpVvMnElD",
|
||
style="border: 0" width="900" height="800"></iframe>
|
||
</section>
|
||
|
||
<section>
|
||
<h2>Computational provenance</h2>
|
||
<ul style="font-size:30px">
|
||
<li>
|
||
The <code>datalad-container</code> extension gives DataLad commands to register software containers as "just another file" to your
|
||
dataset, and <strong>datalad containers-run</strong> analysis inside the container, capturing software as additional
|
||
provenance
|
||
</li>
|
||
</ul>
|
||
<img class="fragment fade-in" src="../pics/containers-run.svg" height="600"> <!-- .element: class="fragment" -->
|
||
</section>
|
||
|
||
<section style="text-align: left;">
|
||
<h3>...Computationally reproducible execution</h3>
|
||
|
||
<div class="fragment">
|
||
Let's try out the <code>containers-run</code> command:
|
||
<pre style="margin-left: 0;">
|
||
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
||
datalad containers-run -m "Compute brain mask" \
|
||
-n nilearn \
|
||
--input input/sub-02/func/sub-02_task-oneback_run-01_bold.nii.gz \
|
||
--output "figures/*" \
|
||
--output "sub-02*" \
|
||
"python code/get_brainmask.py"
|
||
</code>
|
||
</pre>
|
||
</div>
|
||
<div class="fragment">
|
||
You can now query an individual file how it came to be…
|
||
<pre style="margin-left: 0;">
|
||
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
||
git log sub-02_brain-mask.nii.gz
|
||
</code>
|
||
</pre>
|
||
</div>
|
||
|
||
<div class="fragment">
|
||
… and the computation can be redone automatically and checked for computational reproducibility based on the recorded provenance using datalad rerun:
|
||
<pre style="margin-left: 0;">
|
||
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
||
datalad rerun
|
||
</code>
|
||
</pre>
|
||
</div>
|
||
</section>
|
||
|
||
|
||
<section><script src="https://cdn.logwork.com/widget/countdown.js"></script>
|
||
<a href="https://logwork.com/countdown-2zu8" class="countdown-timer"
|
||
data-style="columns" data-timezone="Europe/Berlin" data-date="2023-09-28 14:00">
|
||
Quick break </a><br>
|
||
we're back shortly
|
||
</section>
|
||
|
||
|
||
</section>
|
||
|
||
<!-------- DATA PUBLICATION & OSF -------->
|
||
|
||
<section>
|
||
|
||
<section>
|
||
<h2>Sharing datasets</h2>
|
||
<div class="r-stack">
|
||
<img class="fragment fade-out" data-fragment-index="1" src="../pics/services_only.png">
|
||
<img class="fragment fade-in" data-fragment-index="1" src="../pics/services_connected.png">
|
||
</div>
|
||
<small>Apart from <b>local computing infrastructure</b> (from private laptops to computational clusters),
|
||
datasets can be hosted in major <b>third party repository hosting and cloud storage</b> services.
|
||
More info: Chapter on <a href="http://handbook.datalad.org/en/latest/basics/basics-thirdparty.html" target="_blank">
|
||
Third party infrastructure</a>.</small>
|
||
</section>
|
||
|
||
<section>
|
||
<h2>Sharing datasets</h2><br>
|
||
There are lots of available services, but we will focus on the Open Science Framework.<br>
|
||
<iframe src="https://directpoll.com/r?XDbzPBd3ixYqg84Gif8nU69RJWPkCXwpVvMnElD",
|
||
style="border: 0" width="900" height="800"></iframe>
|
||
</section>
|
||
|
||
<section>
|
||
<h3>Transport logistics: Lots of data, little disk-usage</h3>
|
||
<ul>
|
||
<li class="fragment fade-in">
|
||
Cloned datasets are lean.
|
||
"Meta data" (file names, availability) are present, but <b>no file content</b>:</li>
|
||
<pre class="fragment fade-in"><code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">$ datalad clone git@github.com:psychoinformatics-de/studyforrest-data-phase2.git
|
||
install(ok): /tmp/studyforrest-data-phase2 (dataset)
|
||
$ cd studyforrest-data-phase2 && du -sh
|
||
18M .</code></pre>
|
||
|
||
<li class="fragment fade-in">
|
||
files' contents can be retrieved on demand:
|
||
</li>
|
||
</ul>
|
||
<pre class="fragment fade-in"><code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">$ datalad get sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz
|
||
get(ok): /tmp/studyforrest-data-phase2/sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz (file) [from mddatasrc...]</code></pre>
|
||
|
||
<li class="fragment fade-in">Have access to more data on your computer than you have disk-space:</li>
|
||
<pre class="fragment fade-in"><code># eNKI dataset (1.5TB, 34k files):
|
||
$ du -sh
|
||
1.5G .
|
||
# HCP dataset (~200TB, >15 million files)
|
||
$ du -sh
|
||
48G . </code></pre>
|
||
</section>
|
||
|
||
<section data-markdown data-transition="None"> <script type="text/template">
|
||
## Plenty of data, but little disk-usage
|
||
|
||
Drop file content that is not needed:<!-- .element: class="fragment fade-in" -->
|
||
<pre class="fragment fade-in"><code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">$ datalad drop sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz
|
||
drop(ok): /tmp/studyforrest-data-phase2/sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz (file) [checking https://arxiv.org/pdf/0904.3664v1.pdf...]</code></pre>
|
||
When files are dropped, only "meta data" stays behind, and they can be re-obtained on demand.<!-- .element: class="fragment fade-in" -->
|
||
<pre><code class="python">dl.get('input/sub-01')
|
||
[really complex analysis]
|
||
dl.drop('input/sub-01')
|
||
</code></pre><!-- .element: class="fragment fade-in" -->
|
||
</script></section>
|
||
|
||
<section data-transition="None" style="vertical-align:top">
|
||
<h3>There are two version control tools at work - why?</h3>
|
||
<p class="fragment fade-in">Git does not handle large files well.
|
||
<div class="r-stack">
|
||
<img class="fragment" src="../pics/gitsnapshot.png">
|
||
</div>
|
||
</p>
|
||
</section>
|
||
|
||
<section data-transition="None">
|
||
<h3>There are two version control tools at work - why?</h3>
|
||
<p>Git does not handle large files well.
|
||
<img src="../pics/gitsnapshot2.png">
|
||
</p>
|
||
<p class="fragment fade-in">
|
||
And repository hosting services refuse to handle large files:
|
||
<img src="../pics/pushing_large_files_to_Git.png"></p>
|
||
<p style="z-index: 100;position: fixed; font-size:35px;margin-top:-450px;margin-bottom:300px;margin-left:1000px">
|
||
<img class="fragment" src="../pics/horrofied.png" height="380px"></p>
|
||
<p class="fragment fade-in">git-annex to the rescue! Let's take a look how it works</p>
|
||
</section>
|
||
|
||
<section>
|
||
<h2>Git versus Git-annex</h2>
|
||
<img height="500" src="../pics/artwork/src/publishing/publishing_gitvsannex.svg">
|
||
</section>
|
||
|
||
|
||
<section>
|
||
<h2>Dataset internals</h2>
|
||
<ul style="font-size:35px">
|
||
<li>Where the filesystem allows it, annexed files are symlinks:
|
||
<pre><code>$ ls -l sub-02/func/sub-02_task-oneback_run-01_bold.nii.gz
|
||
lrwxrwxrwx 1 adina adina 142 Jul 22 19:45 sub-02/func/sub-02_task-oneback_run-01_bold.nii.gz ->
|
||
../../.git/annex/objects/kZ/K5/MD5E-s24180157--aeb0e5f2e2d5fe4ade97117a8cc5232f.nii.gz/MD5E-s24180157
|
||
--aeb0e5f2e2d5fe4ade97117a8cc5232f.nii.gz
|
||
</code></pre><small>(PS: especially useful in datasets with many identical files) </small></li>
|
||
<li>The symlink reveals this internal data organization based on identity hash:
|
||
<pre><code>$ md5sum sub-02/func/sub-02_task-oneback_run-01_bold.nii.gz
|
||
aeb0e5f2e2d5fe4ade97117a8cc5232f sub-02/func/sub-02_task-oneback_run-01_bold.nii.gz
|
||
</code></pre></li>
|
||
<li class="fragment fade-in">The (tiny) symlink instead of the (potentially large) file content is
|
||
committed - version controlling precise file identity without checking contents into Git
|
||
<img src="../pics/annex-commit.png"></li>
|
||
<li class="fragment fade-in">File contents can be shared via almost all
|
||
standard infrastructure. File availability information is a decentral network.
|
||
A file can exist in multiple different locations.</li>
|
||
<pre class="fragment fade-in" ><code class="fragment fade-in" data-fragment-index="1">$ git annex whereis code/nilearn-tutorial.pdf
|
||
whereis code/nilearn-tutorial.pdf (2 copies)
|
||
cf13d535-b47c-5df6-8590-0793cb08a90a -- [datalad]
|
||
e763ba60-7614-4b3f-891d-82f2488ea95a -- jovyan@jupyter-adswa:~/my-analysis [here]
|
||
|
||
datalad: https://raw.githubusercontent.com/datalad-handbook/resources/master/nilearn-tutorial.pdf
|
||
</code></pre>
|
||
</ul>
|
||
<small><p >Delineation and advantages of decentral versus central RDM:<a href="https://doi.org/10.1515/nf-2020-0037" target="_blank">
|
||
Hanke et al., (2021). In defense of decentralized research data management</a></small>
|
||
</section>
|
||
|
||
<section>
|
||
<h2>Git versus Git-annex</h2>
|
||
<dl>
|
||
<dt>Data in datasets is either stored in Git or git-annex</dt>
|
||
<dd>By default, everything is <i>annexed</i>.</dd>
|
||
<small>
|
||
<table class="fragment fade-in">
|
||
<tr>
|
||
<td style="vertical-align: middle">
|
||
<strong>Two consequences:</strong>
|
||
<li>Annexed contents are not available right after cloning,
|
||
only content identity and availability information (as they are stored in Git).
|
||
Everything that is annexed needs to be retrieved with <code>datalad get</code>
|
||
from whereever it is stored.
|
||
</li>
|
||
<li>Files stored in Git are modifiable, annexed files are protected against accidental modifcations</li>
|
||
</td>
|
||
<td width="60%">
|
||
<img src="../pics/git_vs_gitannex.svg" height="500">
|
||
</td>
|
||
</tr>
|
||
</table>
|
||
<table class="fragment fade-in">
|
||
<tr>
|
||
<td><b>Git</b></td>
|
||
<td><b>git-annex</b></td>
|
||
</tr>
|
||
<tr>
|
||
<td>handles <b>small</b> files well (text, code)</td>
|
||
<td>handles <b>all</b> types and sizes of files well</td>
|
||
</tr>
|
||
<tr>
|
||
<td>file contents are in the Git history
|
||
and will be <b>shared</b> upon git/datalad push</td>
|
||
<td>file contents are in the annex. Not necessarily shared</td>
|
||
</tr>
|
||
<tr>
|
||
<td>Shared with every dataset clone</td>
|
||
<td><b>Can be kept private</b> on a per-file level when sharing the dataset</td>
|
||
</tr>
|
||
<tr>
|
||
<td>Useful: Small, non-binary, frequently modified, need-to-be-accessible (DUA, README) files </td>
|
||
<td>Useful: Large files, private files</td>
|
||
</tr>
|
||
</table>
|
||
</small>
|
||
<br><br><small>Useful background information for demo later. Read
|
||
<a href="http://handbook.datalad.org/en/latest/basics/101-115-symlinks.html" target="_blank">
|
||
this handbook chapter</a> for details
|
||
</a> </small>
|
||
</dl>
|
||
</section>
|
||
|
||
<section>
|
||
<h2>Git versus Git-annex</h2>
|
||
<ul>
|
||
Users can decide which files are annexed:
|
||
<br><br>
|
||
<li><b>Pre-made run-procedures</b>, provided by DataLad (e.g., <code>text2git</code>, <code>yoda</code>)
|
||
or created and shared by users
|
||
(<a href="http://handbook.datalad.org/en/latest/basics/101-124-procedures.html" target="_blank">Tutorial</a>) </li>
|
||
<li>Self-made configurations in <code>.gitattributes</code> (e.g., based on file type,
|
||
file/path name, size, ...; <a href="http://handbook.datalad.org/en/latest/basics/101-123-config2.html#gitattributes" target="_blank">
|
||
rules and examples
|
||
</a> )</li>
|
||
<li>Per-command basis (e.g., via <code>datalad save --to-git</code>)</li>
|
||
</ul>
|
||
</section>
|
||
|
||
|
||
<section data-transition="None">
|
||
<h2>Publishing datasets</h2>
|
||
I have a dataset on my computer. How can I share it, or collaborate on it?
|
||
<img height="900" src="../pics/startingpoint.svg">
|
||
</section>
|
||
|
||
<section data-transition="None">
|
||
<h2>Glossary</h2>
|
||
<dl style="font-size:30px">
|
||
<dt class="fragment fade-in" data-fragment-index="1">
|
||
Sibling (remote)</dt>
|
||
<dd class="fragment fade-in" data-fragment-index="1">
|
||
Linked clones of a dataset. You can usually update (from) siblings to keep all your siblings in sync
|
||
(e.g., ongoing data acquisition stored on experiment compute and backed up on cluster and external hard-drive)
|
||
</dd>
|
||
<dt class="fragment fade-in" data-fragment-index="2">
|
||
Repository hosting service</dt>
|
||
<dd class="fragment fade-in" data-fragment-index="2">
|
||
Webservices to host Git repositories, such as GitHub, GitLab, Bitbucket, Gin, ...</dd>
|
||
<dt class="fragment fade-in" data-fragment-index="3">
|
||
Third-party storage</dt>
|
||
<dd class="fragment fade-in" data-fragment-index="3">
|
||
Infrastructure (private/commercial/free/...) that can host data. A "special remote" protocol
|
||
is used to publish or pull data to and from it
|
||
</dd>
|
||
<dt class="fragment fade-in" data-fragment-index="4">
|
||
Publishing datasets</dt>
|
||
<dd class="fragment fade-in" data-fragment-index="4">
|
||
<em>Pushing</em> dataset contents (Git and/or annex) to a sibling using <strong>datalad push</strong></dd>
|
||
<dt class="fragment fade-in" data-fragment-index="5">
|
||
Updating datasets</dt>
|
||
<dd class="fragment fade-in" data-fragment-index="5">
|
||
<em>Pulling</em> new changes from a sibling using <strong>datalad update --merge</strong></dd>
|
||
</dl>
|
||
</section>
|
||
|
||
<section data-transition="None">
|
||
<h2>Publishing datasets</h2>
|
||
<ul>
|
||
<li>Most public datasets separate content in Git versus git-annex behind the scenes</li>
|
||
</ul>
|
||
<img height="900" src="../pics/artwork/src/publishing/publishing_network_gitvsannex.svg">
|
||
|
||
</section>
|
||
|
||
<section data-transition="None">
|
||
<h2>Publishing datasets</h2>
|
||
<img height="900" src="../pics/artwork/src/publishing/publishing_network_publishparts.svg">
|
||
</section>
|
||
|
||
<section data-transition="None">
|
||
<h2>Publishing datasets</h2>
|
||
<img height="900" src="../pics/artwork/src/publishing/publishing_network_publishparts2.svg">
|
||
</section>
|
||
|
||
<section data-transition="None">
|
||
<h2>Publishing datasets</h2>
|
||
Typical case:
|
||
<ul style="font-size:30px">
|
||
<li class="fragment fade-in">
|
||
Datasets are exposed via a private or public repository on a
|
||
repository hosting service
|
||
</li>
|
||
<li class="fragment fade-in">
|
||
Data can't be stored in the repository hosting service, but can be
|
||
kept in almost any third party storage
|
||
</li>
|
||
<li class="fragment fade-in">
|
||
Publication dependencies automate pushing to the correct place, e.g.,
|
||
<pre>
|
||
<code class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
||
$ git config --local remote.github.datalad-publish-depends gdrive
|
||
# or
|
||
$ datalad siblings add --name origin --url git@git.jugit.fzj.de:adswa/experiment-data.git --publish-depends s3
|
||
</code>
|
||
</pre>
|
||
</li>
|
||
</ul>
|
||
<img src="../pics/artwork/src/publishing/publishing_network_publishdepends.svg">
|
||
</section>
|
||
|
||
|
||
<section data-transition="None">
|
||
<h2>Publishing datasets</h2>
|
||
<p style="font-size:30px"> Special case 1: repositories with annex support</p>
|
||
<img height="850" class="fragment fade-in" src="../pics/artwork/src/publishing/publishing_network_publishgin.svg">
|
||
</section>
|
||
|
||
<section data-transition="None">
|
||
<h2>Publishing datasets</h2>
|
||
<p style="font-size:30px">Special case 2: Special remotes with repositories</p>
|
||
<img height="850" src="../pics/artwork/src/publishing/publishing_network_publishosf.svg">
|
||
</section>
|
||
|
||
|
||
<section>
|
||
<h2><code>Publishing to OSF</code></h2>
|
||
<p><a href="https://osf.io/">https://osf.io/</a></p>
|
||
<img src="../pics/git-annex-osf-logo.png" alt="datalad-osf-logo" width="50%">
|
||
</section>
|
||
|
||
<section style="text-align: left;">
|
||
<div style="display: flex !important; align-items: center">
|
||
<h2>create-sibling-osf</h2> <a href="https://docs.datalad.org/projects/osf/en/latest/" target="_blank">(docs)</a>
|
||
</div>
|
||
Requires the DataLad extensions <code>datalad-osf</code> and <code>datalad-next</code><br><br>
|
||
|
||
<ol>Prerequisites:
|
||
<li class="fragment">Log into OSF</li>
|
||
<li class="fragment">Create personal access token</li>
|
||
<li class="fragment">Enter credentials using <code>datalad osf-credentials</code>:</li>
|
||
</ol>
|
||
<div class="fragment">
|
||
<pre style="margin-left: 0;">
|
||
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
||
datalad osf-credentials
|
||
</code>
|
||
</pre>
|
||
</div>
|
||
</section>
|
||
|
||
<section style="text-align: left;">
|
||
<div style="display: flex !important; align-items: center">
|
||
<h2>create-sibling-osf</h2> <a href="https://docs.datalad.org/projects/osf/en/latest/" target="_blank">(docs)</a>
|
||
</div>
|
||
|
||
<div>
|
||
Create the sibling in your dataset (different modes are possible):
|
||
<pre style="margin-left: 0;">
|
||
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
||
datalad create-sibling-osf -d . -s my-osf-sibling \
|
||
--title 'my-osf-project-title' --mode export --public
|
||
</code>
|
||
</pre>
|
||
</div>
|
||
<div class="fragment">
|
||
Push to the sibling:
|
||
<pre style="margin-left: 0;">
|
||
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
||
datalad push -d . --to my-osf-sibling
|
||
</code>
|
||
</pre>
|
||
</div>
|
||
<div class="fragment">
|
||
Clone from the sibling:
|
||
<pre style="margin-left: 0;">
|
||
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
||
cd ..
|
||
datalad clone osf://my-osf-project-id my-osf-clone
|
||
</code>
|
||
</pre>
|
||
</div>
|
||
</section>
|
||
|
||
<section><script src="https://cdn.logwork.com/widget/countdown.js"></script>
|
||
<a href="https://logwork.com/countdown-2zu8" class="countdown-timer"
|
||
data-style="columns" data-timezone="Europe/Berlin" data-date="2023-09-28 15:30">
|
||
Quick break </a><br>
|
||
Next up: Your Questions and Usecases
|
||
</section>
|
||
|
||
</section>
|
||
|
||
<!-- QUESTIONS -->
|
||
|
||
<section>
|
||
|
||
|
||
<section>
|
||
<h2>Summary and Take-Home Messages</h2>
|
||
</section>
|
||
|
||
<section data-markdown data-transition="none"><script type="text/template">
|
||
## Exhaustive tracking of research components
|
||
<!-- .element: width="100%" -->
|
||
Well-structured datasets (using community standards), and portable computational environments — and their evolution — are the precondition for reproducibility
|
||
|
||
<table width=100% style="padding:0px">
|
||
<tr><td style="padding:0px">
|
||
<code><pre>
|
||
# turn any directory into a dataset
|
||
# with version control
|
||
|
||
% datalad create <directory>
|
||
</pre></code>
|
||
</td><td style="padding:0px">
|
||
<code><pre>
|
||
# save a new state of a dataset with
|
||
# file content of any size
|
||
|
||
% datalad save
|
||
</pre></code>
|
||
</td></tr></table>
|
||
Note:
|
||
- link to prev. statements on description standards
|
||
- your community could be really small (your lab), when data are precious resources
|
||
will be spent to understand it, but information must be capture to make this possible
|
||
</script></section>
|
||
|
||
<section data-markdown data-transition="none"><script type="text/template">
|
||
## Capture computational provenance
|
||
<!-- .element: width="100%" -->
|
||
Which data was needed at which version, as input into which code, running with what parameterization in which
|
||
computional environment, to generate an outcome?
|
||
|
||
<table width=100% style="padding:0px">
|
||
<tr><td style="padding:0px">
|
||
<code><pre>
|
||
# execute any command and capture its output
|
||
# while recording all input versions too
|
||
|
||
% datalad run --input ... --output ... <command>
|
||
</pre></code>
|
||
</td></tr></table>
|
||
|
||
Note:
|
||
The missing link: even when everything is shared, we still don't know how to start.
|
||
README is minimum, but executable prov-records are much better.
|
||
</script></section>
|
||
|
||
<section data-markdown data-transition="none"><script type="text/template">
|
||
## Exhaustive capture enables portability
|
||
<!-- .element: width="100%" -->
|
||
Precise identification of data and computational environments
|
||
combined with provenance records form a comprehensive and portable
|
||
data structure, capturing all aspects of an investigation.
|
||
|
||
<table width=100% style="padding:0px">
|
||
<tr><td style="padding:0px">
|
||
<code><pre>
|
||
# transfer data and metadata to other sites and services
|
||
# with fine-grained access control for dataset components
|
||
|
||
% datalad push --to <site-or-service>
|
||
</pre></code>
|
||
</td></tr></table>
|
||
|
||
Note:
|
||
Does it fly? Can you give it to someone? Or can you take it with you to your new lab?
|
||
</script></section>
|
||
|
||
<section data-markdown data-transition="none"><script type="text/template">
|
||
## Reproducibility strengthens trust
|
||
<!-- .element: width="100%" -->
|
||
Outcomes of computational transformations can be validated by authorized 3rd-parties. This enables audits, promotes accountability, and streamlines automated "upgrades" of outputs
|
||
|
||
<table width=100% style="padding:0px">
|
||
<tr><td style="padding:0px">
|
||
<code><pre>
|
||
# obtain dataset (initially only identity,
|
||
# availability, and provenance metadata)
|
||
|
||
% datalad clone <url>
|
||
</pre></code>
|
||
</td><td style="padding:0px">
|
||
<code><pre>
|
||
# immediately actionable provenance records
|
||
# full abstraction of input data retrieval
|
||
|
||
% datalad rerun <commit|tag|range>
|
||
</pre></code>
|
||
</td></tr></table>
|
||
Note:
|
||
Goal is automated reproducibility, enables assessment of robustness and benchmarking algorithmic developments
|
||
</script></section>
|
||
|
||
<section data-markdown data-transition="none"><script type="text/template">
|
||
## Ultimate goal: (re-)usability
|
||
<!-- .element: width="100%" -->
|
||
Verifiable, portable, self-contained data structures that track all aspects of an investigation exhaustively can be (re-)used as modular components in larger contexts — propagating their traits
|
||
|
||
<table width=100% style="padding:0px">
|
||
<tr><td style="padding:0px">
|
||
<code><pre>
|
||
# declare a dependency on another dataset and
|
||
# re-use it a particular state in a new context
|
||
|
||
% datalad clone -d <superdataset> <url> <path-in-dataset>
|
||
</pre></code>
|
||
</td></tr></table>
|
||
|
||
Note:
|
||
With these in place, re-usability is a small(er) step
|
||
</script></section>
|
||
|
||
<section>
|
||
<h2>Your Questions and Usecases</h2>
|
||
</section>
|
||
|
||
|
||
<section>
|
||
<h2>Post-Workshop Contact</h2>
|
||
<ul>
|
||
<li class="fragment fade-in">Slides are CC-BY. They will stay online and will be made available as a PDF as well</li>
|
||
<li class="fragment fade-in">Contact the DataLad Team anytime via GitHub issue, Matrix chat message, or in our office hour video call</li>
|
||
<li class="fragment fade-in">Find more DataLad content and tutorials at <a href="https://handbook.datalad.org" target="_blank">handbook.datalad.org</a></li>
|
||
<br>
|
||
<li class="fragment fade-in">Join us at our first conference for distributed data management:
|
||
<a href="https://distribits.live/" target="_blank">distribits.live</a> (April 2024, registration closes October 15th)</li>
|
||
</ul>
|
||
<br><br>
|
||
<h3 class="fragment fade-in">Thanks for you attention!</h3>
|
||
</section>
|
||
|
||
<section style="text-align:left">
|
||
<h2>List of installed software on Jupyter</h2>
|
||
The JupyterHub runs on Ubuntu 22.04 via an AWS EC2 instance. The following packages were installed with different package managers:
|
||
<br><br>
|
||
<ul>
|
||
<li>apt: Git, git-annex, tree, tig, zsh, singularity</li>
|
||
<li>pip: datalad, datalad-next, datalad-container, datalad-osf, black</li>
|
||
</ul>
|
||
<br><br>
|
||
Instructions to set up and configure your own JupyterHub are publicly available at <a href="https://psychoinformatics-de.github.io/rdm-course/for_instructors/index.html" target="_blank">
|
||
psychoinformatics-de.github.io/rdm-course/for_instructors
|
||
</a>
|
||
<ul></ul>
|
||
</section>
|
||
|
||
</section>
|
||
|
||
<!--- OUTLOOK --->
|
||
|
||
<section>
|
||
|
||
<section>
|
||
<h2>Outlook</h2>
|
||
</section>
|
||
|
||
<section data-markdown data-transition="None"><script type="text/template">
|
||
## FAIRly big: Scaling up
|
||
|
||
Objective: Process the UK Biobank (imaging data)
|
||
<!-- .element: height="400" -->
|
||
|
||
- 76 TB in 43 million files in total
|
||
- 42,715 participants contributed personal health data
|
||
- Strict DUA
|
||
- Custom binary-only downloader
|
||
- Most data records offered as (unversioned) ZIP files
|
||
</script></section>
|
||
|
||
<section data-markdown data-transition="None"><script type="text/template">
|
||
## Challenges
|
||
|
||
- Process data such that
|
||
- Results are computationally reproducible (without the original compute infrastructure)
|
||
- There is complete linkage from results to an individual data record download
|
||
- It scales with the amount of available compute resources
|
||
|
||
- Data processing pipeline
|
||
- Compiled MATLAB blob
|
||
- 1h processing time per image, with 41k images to process
|
||
- 1.2 M output files (30 output files per input file)
|
||
- 1.2 TB total size of outputs
|
||
</script></section>
|
||
|
||
<section data-transition="None">
|
||
<h2> FAIRly big setup</h2>
|
||
<img src="../pics/fairlybig_ukbsetup.png" width="1200" style="margin-top:-35px;margin-bottom:-30px">
|
||
|
||
<ul style="font-size:30px">
|
||
<strong>Exhaustive tracking</strong>
|
||
<li><a href="https://github.com/datalad/datalad-ukbiobank" target="_blank">datalad-ukbiobank</a>
|
||
extension downloads, transforms & track the evolution of the complete data release
|
||
in DataLad datasets
|
||
</li>
|
||
<li>Native and BIDSified data layout (at no additional disk space usage)</li>
|
||
<li>Structured in 42k individual datasets, combined to one superdataset</li>
|
||
<li>Containerized pipeline in a software container</li>
|
||
<li>Link input data & computational pipeline as dependencies</li>
|
||
</ul>
|
||
<br><br>
|
||
<small><a href="https://www.nature.com/articles/s41597-022-01163-2" target="_blank">
|
||
Wagner, Waite, Wierzba et al. (2021). FAIRly big: A framework for computationally reproducible processing of large-scale data.</a>
|
||
</small>
|
||
</section>
|
||
|
||
<section data-transition="None">
|
||
<h2>FAIRly big workflow</h2>
|
||
<div class="r-stack">
|
||
<img class="fragment fade-out" src="../pics/fairlybig_workflow.png" width="1200" style="margin-top:-35px;margin-bottom:-30px">
|
||
<img src="../pics/htcondor.svg" class="fragment fade-in">
|
||
</div>
|
||
<br>
|
||
<ul style="font-size:30px">
|
||
<strong>portability</strong>
|
||
<li>Parallel processing: 1 job = 1 subject
|
||
(number of concurrent jobs capped at the capacity of the compute cluster)
|
||
</li>
|
||
<li>Each job is computed in a ephemeral (short-lived) dataset clone, results are pushed back:
|
||
Ensure exhaustive tracking &
|
||
portability during computation</li>
|
||
<li>Content-agnostic persistent (encrypted) storage (minimizing storage and inodes)</li>
|
||
<li>Common data representation in secure environments</li>
|
||
</ul>
|
||
<br><br>
|
||
<small><a href="https://www.nature.com/articles/s41597-022-01163-2" target="_blank">
|
||
Wagner, Waite, Wierzba et al. (2021). FAIRly big: A framework for computationally reproducible processing of large-scale data.</a>
|
||
</small></section>
|
||
|
||
|
||
|
||
<section data-transition="None">
|
||
<h2>FAIRly big provenance capture</h2>
|
||
<img src="../pics/fairlybig_prov.png" width="1200" style="margin-top:-35px;margin-bottom:-30px">
|
||
<br><br>
|
||
<ul style="font-size:30px">
|
||
<strong>Provenance</strong>
|
||
<li>Every single pipeline execution is tracked</li>
|
||
<li>Execution in ephemeral workspaces ensures results
|
||
individually reproducible without HPC access</li>
|
||
</ul>
|
||
<br><br>
|
||
<small><a href="https://www.nature.com/articles/s41597-022-01163-2" target="_blank">
|
||
Wagner, Waite, Wierzba et al. (2021). FAIRly big: A framework for computationally reproducible processing of large-scale data.</a>
|
||
</small></section>
|
||
|
||
<section data-markdown><script type="text/template">
|
||
## FAIRly big movie
|
||
|
||
<iframe width="1120" height="630" src="https://www.youtube-nocookie.com/embed/UsW6xN2f2jc?start=17" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
|
||
|
||
- Two computations on clusters of different scale (small cluster, supercomputer). Full video: https://youtube.com/datalad
|
||
- Two full (re-)computations, programmatically comparable, verifiable, reproducible -- on any system with data access
|
||
</script></section>
|
||
|
||
</section>
|
||
|
||
|
||
</div>
|
||
</div>
|
||
|
||
<script src="../reveal.js/dist/reveal.js"></script>
|
||
<script src="../reveal.js/plugin/notes/notes.js"></script>
|
||
<script src="../reveal.js/plugin/markdown/markdown.js"></script>
|
||
<script src="../reveal.js/plugin/highlight/highlight.js"></script>
|
||
<script src="../custom_functions.js"></script>
|
||
<script>
|
||
// More info about initialization & config:
|
||
// - https://revealjs.com/initialization/
|
||
// - https://revealjs.com/config/
|
||
Reveal.initialize({
|
||
hash: true,
|
||
// The "normal" size of the presentation, aspect ratio will be preserved
|
||
// when the presentation is scaled to fit different resolutions. Can be
|
||
// specified using percentage units.
|
||
width: 1280,
|
||
height: 960,
|
||
// Factor of the display size that should remain empty around the content
|
||
margin: 0.3,
|
||
// Bounds for smallest/largest possible scale to apply to content
|
||
minScale: 0.2,
|
||
maxScale: 1.0,
|
||
|
||
controls: true,
|
||
progress: true,
|
||
history: true,
|
||
center: true,
|
||
slideNumber: 'c',
|
||
pdfSeparateFragments: false,
|
||
pdfMaxPagesPerSlide: 1,
|
||
pdfPageHeightOffset: -1,
|
||
transition: 'slide', // none/fade/slide/convex/concave/zoom
|
||
// Learn about plugins: https://revealjs.com/plugins/
|
||
plugins: [ RevealMarkdown, RevealHighlight, RevealNotes ]
|
||
});
|
||
</script>
|
||
</body>
|
||
</html>
|