datalad-course/html/sfb-1280.html

1849 lines
74 KiB
HTML
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<!doctype html>
<html>
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no">
<!-- Edit me start! -->
<title>DataLad 4 SFB 1280</title>
<meta name="description" content=" Virtual DataLad course for the SFB 1280 Bochum/Essen/Dortmund ">
<meta name="author" content=" Adina Wagner ">
<!-- Edit me end! -->
<link rel="stylesheet" href="../reveal.js/dist/reset.css">
<link rel="stylesheet" href="../reveal.js/dist/reveal.css">
<link rel="stylesheet" href="../reveal.js/dist/theme/beige.css">
<link rel="stylesheet" href="../css/main.css">
<!-- Theme used for syntax highlighted code -->
<link rel="stylesheet" href="../reveal.js/plugin/highlight/monokai.css">
</head>
<body>
<div class="reveal">
<div class="slides">
<section>
<section>
<script src="https://cdn.logwork.com/widget/countdown.js"></script>
<a href="https://logwork.com/countdown-2zu8" class="countdown-timer"
data-style="columns" data-timezone="Europe/Berlin" data-date="2023-09-28 13:00">
Workshop starts in
</a>
Have a ☕!
</section>
<section>
<h2>Research data management<br />👩‍💻👨‍💻<br />with DataLad</h2>
<div style="margin-top:1em;text-align:center">
<table style="border: none;">
<tr>
<td>
Adina Wagner<br><small><a href="https://mas.to/@adswa" target="_blank">
<img data-src="../pics/mastodon.svg" style="height:30px;margin:0px" /> mas.to/@adswa</a></small>
</td>
<td>
<br>
</td>
</tr>
<tr>
<td>
<img style="height:70px;margin-right:10px" data-src="../pics/fzj_logo.svg" /><br>
</td>
<td style="vertical-align:top">
<small><a href="http://psychoinformatics.de" target="_blank">Psychoinformatics lab</a>,
<br> Institute of Neuroscience and Medicine (INM-7)<br>
Research Center Jülich</small><br>
</td>
</tr>
</table>
</div>
<br><br><small>
Interactive Slides: <a href="https://files.inm7.de/adina/talks/html/sfb-1280.html" target="_blank">files.inm7.de/adina/talks/html/sfb-1280.html</a><br>
PDF for download: <a href="https://files.inm7.de/adina/talks/pdfs/sfb-1280.pdf" target="_blank">files.inm7.de/adina/talks/pdfs/sfb-1280.pdf</a><br>
Sources: <a href="https://github.com/datalad-handbook/datalad-course/blob/main/html/sfb-1280.html" target="_blank">
https://github.com/datalad-handbook/datalad-course</a></small>
</section>
</section>
<!--...INTRODUCTION AND LOGISTICS (30 Mins)...-->
<section>
<section>
<h2>Welcome & Logistics!</h2>
<ul style="font-size:35px">
<li class="fragment fade-in-then-semi-out">
A approximate schedule for today:
<ul>
<li>1.00 pm: Introduction & Logistics</li>
<li>1.30 pm: Overview of DataLad + break ☕</li>
<li>2.00 pm: What's version control, and why should I care?</li>
<li>2:45 pm: Reproducibility features + break</li>
<li>3.30 pm: Data publication to the OSF + break ☕</li>
<li>4.30 pm: Outlook and/or Your Questions and Usecases</li>
</ul>
</li>
<li class="fragment fade-in-then-semi-out">
Collaborative notes & anonymous questions: <a href="https://etherpad.wikimedia.org/p/Datalad@sfb1280" target="_blank">
etherpad.wikimedia.org/p/Datalad@sfb1280</a>.
</li>
<li class="fragment fade-in-then-semi-out">
Slides are CC-BY and will be shared after the workshop. Additional
workshop contents: <a href="https://psychoinformatics-de.github.io/rdm-course/" target="_blank">
psychoinformatics-de.github.io/rdm-course</a>
</li>
<li class="fragment fade-in-then-semi-out">
Some guidelines for the virtual workshop venue...
</li>
<ul>
<li class="fragment fade-in">
Please mute yourself when you don't speak
</li>
<li class="fragment fade-in">
Ask questions anytime, but make use of the "Raise hand" feature
</li>
<li class="fragment fade-in">
Drop out and re-join as you please
</li>
</ul>
</ul>
</section>
<section>
<h2>Questions/interaction throughout the workshop</h2>
<ul style="font-size:35px">
<li>
There are no stupid questions :)
</li>
<li>
Lively discussions are wonderful - unless its interrupting others,
please feel encouraged to unmute/turn on your video to interact.
</li>
<li>
There is room discuss specific or advanced use cases at the end. Please make a note about them in
the <a href="https://etherpad.wikimedia.org/p/Datalad@sfb1280" target="_blank">Etherpad</a>.
</li>
</ul>
</section>
<section>
<h2>Questions/interaction after the workshop</h2>
<ul>
If you have a question after the workshop, you can reach out for help:<br>
<ul style="font-size:30px">
<dt>Reach out to to the <b>DataLad</b> team via</dt>
<li>
<a href="https://matrix.to/#/!NaMjKIhMXhSicFdxAj:matrix.org?via=matrix.waite.eu&via=matrix.org&via=inm7.de" target="_blank">
Matrix</a> (free, decentralized communication app, no app needed).
We run a weekly Zoom office hour (Tuesday, 4pm Berlin time) from this room as well.
</li>
<li>
<a href="https://github.com/datalad/datalad" target="_blank">
the development repository on GitHub</a>
</li><br>
<dt>Reach out to the user community with</dt>
<li>
A question on <a href="https://neurostars.org/" target="_blank">neurostars.org</a>
with a <code>datalad</code> tag
</li><br>
<dt>Find more user tutorials or workshop recordings</dt>
<li>
On <a href="https://www.youtube.com/datalad" target="_blank">
DataLad's YouTube channel</a>
</li>
<li>
In the <a href="http://handbook.datalad.org/en/latest/" target="_blank">
DataLad Handbook </a>
</li>
<li>
In the <a href="https://psychoinformatics-de.github.io/rdm-course/" target="_blank">DataLad RDM course</a>
</li>
<li>
In the <a href="http://docs.datalad.org" target="_blank">Official API documentation</a>
</li>
</ul>
</ul>
</section>
<section>
<h2>Resources and Further Reading</h2>
<table style="font-size:30px">
<tr>
<td>
Comprehensive user documentation in the<br>
DataLad Handbook
<a href="http://handbook.datalad.org" target="_blank">(handbook.datalad.org)</a>
</td>
<td>
<img src="../pics/logo.svg" height="150">
</td>
</tr>
</table>
<table style="font-size:30px">
<tr>
<td><img src="../pics/artwork/src/enter.svg" height="100"></a></td>
<td>
<ul>
<li>High-level function/command overviews, <br>
Installation, Configuration, Cheatsheet
</li>
</ul>
</td>
</tr>
<tr>
<td><img src="../pics/artwork/src/basics.svg" height="100"></td>
<td>
<ul>
<li>Narrative-based code-along course</li>
<li>Independent on background/skill level, <br>
suitable for data management novices
</li>
</ul>
</td>
</tr>
<tr>
<td><img src="../pics/artwork/src/usecases.svg" height="100"></td>
<td>
<ul>
<li>Step-by-step solutions to common <br>
data management problems, like<br />how to
make a reproducible paper
</li>
</ul>
</td>
</tr>
</table>
<p style="font-size:30px">
Overview of most tutorials, talks, videos, ... at
<a href="https://github.com/datalad/tutorials" target="_blank">
github.com/datalad/tutorials</a>
</p>
</section>
<section>
<h2>Live polling system</h2>
Please use your phone to scan to QR code, or open the link in a new browser window <br>
<iframe src="https://directpoll.com/r?XDbzPBd3ixYqg84Gif8nU69RJWPkCXwpVvMnElD",
style="border: 0" width="900" height="800"></iframe>
</section>
<section>
<h2>What's your mood today?</h2>
<img src="../pics/sheepscale.png" height="600"><iframe src="https://directpoll.com/r?XDbzPBd3ixYqg84Gif8nU69RJWPkCXwpVvMnElD",
style="border: 0" width="400" height="600"></iframe>
</section>
<section>
<h2>Practical aspects</h2>
<img width="200" src="../pics/jupyter_logo.png" alt="jupyterlogo"><br>
<ul>
<li>
We'll work in the browser on a cloud server with JupyterHub
</li>
<li class="fragment">
Cloud-computing environment:<br>
&nbsp;&nbsp;&nbsp;- <a href="https://datalad-hub.inm7.de">datalad-hub.inm7.de</a>
</li>
<li class="fragment">
We have pre-installed DataLad and other requirements
</li>
<li class="fragment">
We will work via the terminal
</li>
<li class="fragment">
Your username is all lower-case and follows this pattern: Firstname + Lastname initial (Adina Wagner -> adinaw)
</li>
<li class="fragment">
Pick any password with at least 8 characters at first log-in (and remember it)
</li>
</ul>
<p class="fragment"> Please try to log in now</p>
</section>
<section data-transition="None">
<h2>Prerequisites: Using DataLad</h2>
<ul style="font-size:30px">
<li>Every DataLad command consists of a main
command followed by a sub-command. The main and the sub-command can have options.
<img height="280px" src="../pics/command-structure.png">
</li>
<li> Example (main command, subcommand, several subcommand options):
<pre><code>$ datalad save -m "Saving changes" --recursive </code></pre>
</li>
<li>
Use <em>--help</em> to find out more about any (sub)command and its
options, including detailed description and examples (<em>q</em> to close).
Use <em>-h</em> to get a short overview of all options
<pre><code>$ datalad save -h
Usage: datalad save [-h] [-m MESSAGE] [-d DATASET] [-t ID] [-r] [-R LEVELS]
[-u] [-F MESSAGE_FILE] [--to-git] [-J NJOBS] [--amend]
[--version]
[PATH ...]
Use '--help' to get more comprehensive information.
</code></pre></li>
</ul>
</section>
<section style="text-align: left;">
<h3>Using DataLad in the Terminal</h3>
Check the installed version:
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
datalad --version
</code>
<p id="displayArea"></p>
</pre>
<div class="fragment">
For help on using DataLad from the command line (press q to exit):
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
datalad --help
</code>
</pre>
</div>
<div class="fragment">
For extensive info about the installed package, its dependencies, and extensions, use <code>datalad wtf</code>.
Let's find out what kind of system we're on:
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
datalad wtf -S system
</code>
</pre>
</div>
</section>
<section style="text-align: left;">
<h3>git identity</h3>
Check git identity:
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
git config --get user.name
git config --get user.email
</code>
</pre>
<div class="fragment">
Configure git identity:
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
git config --global user.name "Adina Wagner"
git config --global user.email "adina.wagner@t-online.de"
</code>
</pre>
</div>
<div class="fragment">
Use the latest datalad features:
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
git config --global --add datalad.extensions.load next
</code>
</pre>
</div>
</section>
<section style="text-align: left;">
<h3>Using datalad via its Python API</h3>
Open a Python environment:
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
ipython
</code>
</pre>
<div class="fragment">
Import and start using:
<pre style="margin-left: 0;">
<code data-trim class="language-python" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
import datalad.api as dl
dl.create(path='mydataset')
</code>
</pre>
</div>
<div class="fragment">
Exit the Python environment:
<pre style="margin-left: 0;">
<code data-trim class="language-python" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
exit
</code>
</pre>
</div>
</section>
<section data-transition="None">
<h2>Different ways to use DataLad</h2>
<ul>
<div>
<li>DataLad can be used from the command line</li>
<pre><code>datalad create mydataset</code></pre>
</div>
<div class="fragment fade-in">
<li>... or with its Python API</li>
<pre><code class="python">import datalad.api as dl
dl.create(path="mydataset")</code></pre>
</div>
<div class="fragment fade-in">
<li>... and other programming languages can use it via system call</li>
<pre><code class="python"># in R
> system("datalad create mydataset")</code></pre>
</div>
<li class="fragment fade-in">... or via a graphical user interface
<a href="https://github.com/datalad/datalad-gooey" target="_blank">"DataLad Gooey"</a>
</li>
<br><br>
</ul>
</section>
</section>
<!----------- OVERVIEW OF DATALAD ---------->
<section>
<section>
<h2>Acknowledgements</h2>
<table>
<tr style="vertical-align:top">
<td style="vertical-align:top">
<dl>
<dt>Software</dt>
<dd style="margin-left:5px!important">
<ul style="margin-left:5px!important">
<li>Joey Hess (git-annex)</li>
<li>The DataLad team &
contributors</li>
</ul>
</dd>
<dt style="margin-top:20px">Illustrations </dt>
<dd style="margin-left:5px!important">
<ul style="margin-left:5px!important">
<li>The Turing Way <br>
project & Scriberia</li>
<img src="../pics/bannerthanks.svg">
</ul>
</dd>
</dl>
</td>
<td style="vertical-align:top">
<div style="margin-bottom:-20px;text-align:center"><strong>Funders</strong></div>
<img style="height:150px;margin-right:50px" data-src="../pics/nsf_2020.png" />
<img style="height:150px;margin-right:50pxi;margin-left:50px" data-src="../pics/binc.png" />
<img style="height:150px;margin-left:50px" data-src="../pics/bmbf_2020.png" />
<img style="height:80px;margin-top:-40px;margin-left:auto;margin-right:auto;width:100%" data-src="../pics/fzj_logo.svg" />
<div style="margin-top:-20px">
<img style="height:60px;margin-right:20px" data-src="../pics/erdf.png" />
<img style="height:60px;margin-right:20px" data-src="../pics/cbbs_logo.png" />
<img style="height:60px" data-src="../pics/LSA-Logo.png" />
</div>
<div style="margin-top:40px;margin-bottom:20px;text-align:center"><strong>Collaborators</strong></div>
<div style="margin-top:-20px">
<img style="height:100px;margin:20px" data-src="../pics/hbp_logo.png" />
<img style="height:100px;margin:20px" data-src="../pics/conp_logo.png" />
<img style="height:100px;margin:20px" data-src="../pics/vbc_logo.png" />
</div>
<div style="margin-top:-40px">
<img style="height:120px;margin:20px" data-src="../pics/openneuro_logo.png" />
<img style="height:120px;margin:20px" data-src="../pics/cbrain_logo.png" />
<img style="height:140px;margin:20px" data-src="../pics/brainlife_logo.png" />
</div>
</td>
</tr>
</table>
</section>
<section>
<h2><img src="../pics/datalad_logo_wide.svg" height="150">Core Features:</h2>
<ul>
<li class="fragment fade-in-then-semi-out">
Joint <b>version control</b> (<a href="https://git-scm.com/" target="_blank">Git</a>,
<a href="https://git-annex.branchable.com/" target="_blank">git-annex</a>): version control data & software alongside your code
</li>
<li class="fragment fade-in-then-semi-out">
<b>Provenance capture</b>:
Create and share machine-readable, re-executable provenance records for reproducible, transparent, and FAIR research
</li>
<li class="fragment fade-in-then-semi-out">
Decentral <b>data transport</b> mechanisms:
Install, share and collaborate on scientific projects; publish,
update, and retrieve their contents in a streamlined fashion on demand,
and distribute files in a decentral network on the services or infrastructures
of your choice
</li>
</ul><br>
</section>
<section data-transition="None">
<h3>Examples of what DataLad can be used for:</h3>
<ul>
<li class="fragment fade-in-then-semi-out">
<b>Publish or consume datasets</b>
via GitHub, GitLab, OSF, the European Open Science Cloud, or similar services
</li>
</ul>
<img height="700" class="fragment fade-in" src="../pics/getdata_studyforrest.gif" alt="a screenrecording of cloning studyforrest data from github">
</section>
<section data-transition="None">
<h3>Examples of what DataLad can be used for:</h3>
<ul>
<li class="fragment fade-in-then-semi-out">
Behind-the-scenes <b>infrastructure component for data transport and versioning</b>
(e.g., used by <a href="https://openneuro.org/" target="_blank"> OpenNeuro</a>,
<a href="https://brainlife.io/" target="_blank"> brainlife.io </a>,
the <a href="https://conp.ca/" target="_blank">Canadian Open Neuroscience Platform (CONP)</a>,
<a href="https://mcin.ca/technology/cbrain/" target="_blank"> CBRAIN</a>)
</li>
</ul>
<img height="700" class="fragment fade-in" src="../pics/openneuro_new_2.gif" alt="a screenrecording of browsing open neuro">
</section>
<section data-transition="None">
<h3>Examples of what DataLad can be used for:</h3>
<ul>
<li class="fragment fade-in-then-semi-out">
<b>Creating and sharing reproducible, open science</b>: Sharing data, software, code, and provenance
</li>
</ul>
<img height="700" class="fragment fade-in" src="../pics/remodnavpaper_2.gif" alt="a screenrecording of cloning REMODNAV paper dataset from github">
</section>
<section data-transition="None">
<h3>Examples of what DataLad can be used for:</h3>
<ul>
<li>
<b>Creating and sharing reproducible, open science</b>: Sharing data, software, code, and provenance
</li>
<img height="800" class="fragment fade-in" src="../pics/openscience.gif" alt="a screenrecording of cloning REMODNAV paper dataset from github">
</ul>
</section>
<section data-transition="None">
<h3>Examples of what DataLad can be used for:</h3>
<ul>
<li class="fragment fade-in-then-semi-out"><b>Central data management</b> and archival system</li>
</ul>
<img height="700" class="fragment fade-in" src="../pics/centralmanagement2.gif">
</section>
<section data-transition="None">
<h3>Examples of what DataLad can be used for:</h3>
<ul>
<li class="fragment fade-in-then-semi-out">
<b>Scalable computing framework</b> for reproducible science
</li>
<img height="350" class="fragment fade-in" src="../pics/fairly-big.png">
<img height="500" class="fragment fade-in" src="../pics/ukb_datasets.svg">
</ul>
</section>
<section><script src="https://cdn.logwork.com/widget/countdown.js"></script>
<a href="https://logwork.com/countdown-2zu8" class="countdown-timer"
data-style="columns" data-timezone="Europe/Berlin" data-date="2023-09-28 14:00">
Quick break
</a><br>
we're back shortly
</section>
</section>
<!----- WHAT'S VERSION CONTROL, AND WHY SHOULD I CARE? ----->
<section>
<section>
<h2>What's version control, and why should I care?</h2><br>
<iframe src="https://directpoll.com/r?XDbzPBd3ixYqg84Gif8nU69RJWPkCXwpVvMnElD",
style="border: 0" width="900" height="800"></iframe>
</section>
<section>
<h2>Everything happens in DataLad datasets</h2>
<img src="../pics/artwork/src/dataset_extended.svg" width="800"> <br><br><br>
<table class="fragment fade-in-then-semi-out" >
<tr>
<td style="vertical-align:middle">
<ul style="font-size:30px">
<li>Look and feel like a directory on your computer</li>
<li>content agnostic</li>
<li>no custom data structures</li>
<img src="../pics/remodnav-ds-terminal.png" width="500"><br><small><br>Terminal view</small>
</ul>
</td>
<td style="font-size:30px; vertical-align:top">
<img src="../pics/remodnav-ds-nautilus.png" width="500"><br>
<small>File viewer</small>
</td>
</tr>
</table>
</section>
<section style="text-align: left;">
<h3>...Datalad datasets</h3>
Create a dataset (here, with the <code>text2git</code> configuration, which adds
a helpful configuration): <br>
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
datalad create -c text2git my-analysis
</code>
</pre>
<div class="fragment">
Let's have a look inside. Navigate using <code>cd</code> (change directory):
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
cd my-analysis
</code>
</pre>
</div>
<div class="fragment">
List the directory content, including hidden files, with <code>ls</code>:
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
ls -la .
</code>
</pre>
</div>
</section>
<section data-transition="None">
<h2>Dataset = Git/git-annex repository</h2>
<li>version control files regardless of size or type</li>
<img src="../pics/artwork/src/local_wf.svg" width="600"> <br>
<ul>
<p class="fragment fade-in">
Stay flexible:
<li class="fragment fade-in">
Non-complex DataLad core API (easy for data management novices)
</li>
<li class="fragment fade-in">
Pure Git or git-annex commands (for regular Git or git-annex users, or to use specific functionality)
</li>
</p>
</ul>
</section>
<section style="text-align: left;">
<h3>...Version control</h3>
Lets build a dataset for an analysis by adding a README. The command below writes a simple header into a new file README.md:
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
echo "# My example DataLad dataset" > README.md
</code>
</pre>
<div class="fragment">
Now we can check the <code>status</code> of the dataset:
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
datalad status
</code>
</pre>
</div>
<div class="fragment">
We can save the state with <code>save</code>
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
datalad save -m "Create a short README"
</code>
</pre>
</div>
<div class="fragment">
Further modifications:
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
echo "This dataset contains a toy data analysis" >> README.md
</code>
</pre>
</div>
<div class="fragment">
You can also checkout what has changed:
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
git diff
</code>
</pre>
</div>
<div class="fragment">
Save again:
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
datalad save -m "Add information on the dataset contents to the README"
</code>
</pre>
</div>
</section>
<section style="text-align: left;">
<h3>...Version control</h3>
<div class="fragment">
Now, let's check the dataset history:
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
git log
</code>
</pre>
</div>
<div class="fragment">
We can also make the history prettier:
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
tig
</code>
(navigate with arrow keys and enter, press "q" to go back and exit the program)
</pre>
</div>
</section>
<section data-transition="None">
<h2>Exhaustive tracking</h2>
<dl style="font-size:35px">
<dt>The building blocks of a scientific result are rarely static</dt>
<table>
<tr>
<td style="vertical-align:middle">Analysis code evolves<br>
<small>(Fix bugs, add functions, refactor, ...)</small>
</td>
<td>
<img src="../pics/final.png" height="500">
<imgcredit>Based on Piled Higher and Deeper
<a href="https://phdcomics.com/comics/archive_print.php?comicid=1531" target="_blank">1531
</a>
</imgcredit></td>
</tr>
</table>
</dl>
</section>
<section data-transition="None">
<h2>Exhaustive tracking</h2>
<dl style="font-size:35px">
<dt>The building blocks of a scientific result are rarely static</dt>
<table>
<tr>
<td style="vertical-align:middle">Data changes <br>
<small>(errors are fixed, data is extended,<br>
naming standards change, an analysis <br>
requires only a subset of your data...)</small></td>
<td><img src="../pics/phd052810s.png" height="500">
<imgcredit>Piled Higher and Deeper
<a href="https://phdcomics.com/comics/archive_print.php?comicid=1323" target="_blank">1323
</a>
</imgcredit>
</td>
</tr>
</table>
</dl>
</section>
<section data-transition="None">
<h2>Exhaustive tracking</h2>
<dl style="font-size:35px">
<dt>The building blocks of a scientific result are rarely static</dt><br>
</dl>
<table>
<tr>
<td style="vertical-align: top">
Data changes (for real) <br>
<small>(errors are fixed, data is extended,<br>
naming standards change, ...)</small>
<img height="180px" src="../pics/abcdtwitter.png">
</td>
<td>
<img width="1000px" src="../pics/abcd.png">
</td>
</tr>
</table>
</section>
<section data-transition="None">
<h2>Exhaustive tracking</h2>
"Shit, which version of which script produced these outputs from which version
of what data... and which software version?"<br>
<img src="../pics/manuallabor.png">
<img src="../pics/findfiles.png" height="400">
<img src="../pics/projectstack.png" height="350">
<imgcredit>CC-BY Scriberia and <a href="https://the-turing-way.netlify.app/reproducible-research/rdm.html" target="_blank">
The Turing Way</a>
</imgcredit>
</section>
<section data-transition="None">
<h3>Exhaustive tracking</h3>
Once you track changes to data with version control tools,
you can find out <em>why</em> it changed, <em>what</em> has changed, <em>when</em> it changed,
and <em>which version</em> of your data was used at which point in time.
<div class="r-stack">
<img class="fragment fade-out" data-fragment-index="1" src="../pics/tigdata.png">
<img class="fragment" data-fragment-index="1" src="../pics/tigdata3.png">
<img class="fragment" src="../pics/tigdata2.png">
</div>
</section>
<section style="text-align: left;">
<h3>Exhaustive tracking</h3>
<div class="fragment">
With the <code>datalad-container</code> extension, we can not only add code or data, but also
software containers to datasets and work with them.
Let's add a software container with Python software for later:
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">datalad containers-add nilearn \
--url shub://adswa/nilearn-container:latest
</code>
</pre>
</div>
<div class="fragment">
inspect the list of registered containers:
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
datalad containers-list
</code>
</pre>
</div>
</section>
</section>
<!-- REPRODUCIBILITY FEATURES -->
<section>
<section>
<h2>Digital provenance</h2>
<ul>
<p >
= <i>"The tools and processes used to create a
digital file, the responsible entity, and when and where the process
events occurred"</i>
</p>
<li class="fragment fade-in">
Have you ever saved a PDF to read later onto your computer, but forgot
where you got it from? Or did you ever find a figure in your project,
but forgot which analysis step produced it?
</li>
</ul>
</section>
<section style="text-align: left;">
<h3>Digital provenance</h3>
<div class="fragment">
Imagine that you are getting a script from a colleague to perform your analysis, but they email it to you or upload it to a random place for to download:
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">wget -P code/ \
https://raw.githubusercontent.com/datalad-handbook/resources/master/get_brainmask.py
</code>
</pre>
</div>
<div class="fragment">
The <code>wget</code> command downloaded a script for extracting a brain mask:
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
datalad status
</code>
</pre>
</div>
<div class="fragment">
Save it into your dataset to have the script ready:
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
datalad save -m "Adding a nilearn-based script for brain masking"
</code>
</pre>
</div>
<div class="fragment">
Convenience functions make downloads easier. Let's add a nilearn tutorial, and also register the original location of this file as digital provenance:
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">datalad download-url -m "Add a tutorial on nilearn" \
-O code/nilearn-tutorial.pdf \
https://raw.githubusercontent.com/datalad-handbook/resources/master/nilearn-tutorial.pdf
</code>
</pre>
</div>
<div class="fragment">
Notice how its automatically saved:
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
datalad status
</code>
</pre>
</div>
<div class="fragment">
Check out the file's history:
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">git log code/nilearn-tutorial.pdf</code>
</pre>
</div>
</section>
<section data-transition="None">
<h2>Provenance and reproducibility</h2>
<strong>datalad run</strong> wraps around anything expressed in a command
line call and saves the dataset modifications resulting from the execution
<img src="../pics/run_basic.svg" height="600"> <!-- .element: class="fragment" -->
</section>
<section data-transition="None">
<h2>Provenance and reproducibility</h2>
<strong>datalad rerun</strong> repeats captured executions. <br>
If the outcomes
differ, it saves a new state of them.
<img src="../pics/rerun.svg" height="350"> <!-- .element: class="fragment" -->
</section>
<section style="text-align:left;">
<h3>... Computationally reproducible execution I</h3>
<div class="fragment">
A variety of processes can modify files. A simple example: Code formatting
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">black code/get_brainmask.py</code>
</pre>
</div>
<div class="fragment">
Version control makes changes transparent:
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">git diff</code>
</pre>
</div>
<div class="fragment">
But its useful to keep track beyond that. Let's discard the latest changes...
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">git restore code/get_brainmask.py</code>
</pre>
</div>
<div class="fragment">
... and record precisely what we did
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">datalad run -m "Reformat code with black" \
"black code/get_brainmask.py"</code>
</pre>
</div>
<div class="fragment">
let's take a look (press q to exit):
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">git show</code>
</pre>
</div>
<div class="fragment">
... and repeat!
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">datalad rerun</code>
</pre>
</div>
</section>
<section data-transition="None">
<h2>Seamless dataset nesting & linkage</h2>
<img src="../pics/dataflow.jpg">
<imgcredit><a href="https://www.frontiersin.org/articles/10.3389/fninf.2012.00009/full" target="_blank">
Poline et al., 2011</a>
</imgcredit>
<img src="../pics/artwork/src/linkage_subds.svg" width="900"> <br>
<!-- <ul>
<li class="fragment fade-in" data-fragment-index="2">Overcomes scaling issues with large amounts of files</li>
<pre class="fragment fade-in" data-fragment-index="2"><code>adina@bulk1 in /ds/hcp/super on git:master❱ datalad status --annex -r
15530572 annex'd files (77.9 TB recorded total size)
nothing to save, working tree clean</code></pre>
<small><a class="fragment fade-in" data-fragment-index="2" href="https://github.com/datalad-datasets/human-connectome-project-openaccess" target="_blank">(github.com/datalad-datasets/human-connectome-project-openaccess)</a></small>
<li class="fragment fade-in">Modularizes research components for transparency, reuse, and access management</li>
</ul>
-->
</section>
<section data-transition="None">
<h2>Seamless dataset nesting & linkage</h2>
<img data-src="../pics/linkage.svg" height="300">
<pre><code class="bash" style="font-size:115%;max-height:none">
$ datalad clone --dataset . http://example.com/ds inputs/rawdata
</code></pre>
<pre><code class="diff" style="max-height:none">$ git diff HEAD~1
diff --git a/.gitmodules b/.gitmodules
new file mode 100644
index 0000000..c3370ba
--- /dev/null
+++ b/.gitmodules
@@ -0,0 +1,3 @@
+[submodule "inputs/rawdata"]
+ path = inputs/rawdata
+ datalad-id = 68bdb3f3-eafa-4a48-bddd-31e94e8b8242
+ datalad-url = http://example.com/importantds
diff --git a/inputs/rawdata b/inputs/rawdata
new file mode 160000
index 0000000..fabf852
--- /dev/null
+++ b/inputs/rawdata
@@ -0,0 +1 @@
+Subproject commit fabf8521130a13986bd6493cb33a70e580ce8572
</code></pre>
<aside class="notes">weighs just a few bytes</aside>
</section>
<section style="text-align: left;">
<h3>...Dataset nesting</h3>
Let's make a nest!
<div class="fragment">
Clone a dataset with analysis data into a specific
location ("input/") in the existing dataset,
making it a <em>sub</em>dataset:
<pre style="margin-left: 0;">
<code class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">datalad clone -d . \
https://gin.g-node.org/adswa/bids-data \
input</code>
</pre>
</div>
<div class="fragment">
Let's see what changed in the dataset, using the <code>subdatasets</code> command:
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
datalad subdatasets
</code>
</pre>
</div>
<div class="fragment">
... and also <code>git show</code>:
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
git show
</code>
</pre>
</div>
</section>
<section style="text-align:left;">
<div class="fragment">
We can now view the cloned dataset's file tree:
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
cd input
ls
</code>
</pre>
</div>
<div class="fragment">
...and also its history
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
tig
</code>
</pre>
</div>
<div class="fragment">
Let's check the dataset size (with the <code>du</code> disk-usage command):
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
du -sh
</code>
</pre>
</div>
<div class="fragment">
Let's check the <em>actual</em> dataset size:
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
datalad status --annex
</code>
</pre>
</div>
<div class="fragment">
You can <code>get</code> or <code>drop</code> annexed file contents depending on your needs:
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
datalad get sub-02
</code>
</pre>
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
datalad drop sub-02
</code>
</pre>
</div>
</section>
<section style="text-align: left;">
<h3>...Computationally reproducible execution...</h3>
Try to execute the downloaded analysis script. Does it work?
<div><pre style="margin-left: 0;"><code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
cd ..
datalad run -m "Compute brain mask" \
--input input/sub-02/func/sub-02_task-oneback_run-01_bold.nii.gz \
--output "figures/*" \
--output "sub-02*" \
"python code/get_brainmask.py"</code></pre></div>
<ul class="fragment">
<li>
Software can be difficult or impossible to install (e.g. conflicts with existing software,
or on HPC) for you or your collaborators
</li>
<li>
Different software versions/operating systems can produce different results:
<a href="https://doi.org/10.3389/fninf.2015.00012" target="_blank">Glatard et al., doi.org/10.3389/fninf.2015.00012</a>
</li>
<li class="fragment fade-in">
<strong>Software containers</strong> encapsulate a software environment and isolate it from
a surrounding operating system. Two common solutions: Docker, Singularity
</li>
</ul>
</section>
<section>
<h2>Software containers</h2><br>
<iframe src="https://directpoll.com/r?XDbzPBd3ixYqg84Gif8nU69RJWPkCXwpVvMnElD",
style="border: 0" width="900" height="800"></iframe>
</section>
<section>
<h2>Computational provenance</h2>
<ul style="font-size:30px">
<li>
The <code>datalad-container</code> extension gives DataLad commands to register software containers as "just another file" to your
dataset, and <strong>datalad containers-run</strong> analysis inside the container, capturing software as additional
provenance
</li>
</ul>
<img class="fragment fade-in" src="../pics/containers-run.svg" height="600"> <!-- .element: class="fragment" -->
</section>
<section style="text-align: left;">
<h3>...Computationally reproducible execution</h3>
<div class="fragment">
Let's try out the <code>containers-run</code> command:
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
datalad containers-run -m "Compute brain mask" \
-n nilearn \
--input input/sub-02/func/sub-02_task-oneback_run-01_bold.nii.gz \
--output "figures/*" \
--output "sub-02*" \
"python code/get_brainmask.py"
</code>
</pre>
</div>
<div class="fragment">
You can now query an individual file how it came to be…
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
git log sub-02_brain-mask.nii.gz
</code>
</pre>
</div>
<div class="fragment">
… and the computation can be redone automatically and checked for computational reproducibility based on the recorded provenance using datalad rerun:
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
datalad rerun
</code>
</pre>
</div>
</section>
<section><script src="https://cdn.logwork.com/widget/countdown.js"></script>
<a href="https://logwork.com/countdown-2zu8" class="countdown-timer"
data-style="columns" data-timezone="Europe/Berlin" data-date="2023-09-28 14:00">
Quick break </a><br>
we're back shortly
</section>
</section>
<!-------- DATA PUBLICATION & OSF -------->
<section>
<section>
<h2>Sharing datasets</h2>
<div class="r-stack">
<img class="fragment fade-out" data-fragment-index="1" src="../pics/services_only.png">
<img class="fragment fade-in" data-fragment-index="1" src="../pics/services_connected.png">
</div>
<small>Apart from <b>local computing infrastructure</b> (from private laptops to computational clusters),
datasets can be hosted in major <b>third party repository hosting and cloud storage</b> services.
More info: Chapter on <a href="http://handbook.datalad.org/en/latest/basics/basics-thirdparty.html" target="_blank">
Third party infrastructure</a>.</small>
</section>
<section>
<h2>Sharing datasets</h2><br>
There are lots of available services, but we will focus on the Open Science Framework.<br>
<iframe src="https://directpoll.com/r?XDbzPBd3ixYqg84Gif8nU69RJWPkCXwpVvMnElD",
style="border: 0" width="900" height="800"></iframe>
</section>
<section>
<h3>Transport logistics: Lots of data, little disk-usage</h3>
<ul>
<li class="fragment fade-in">
Cloned datasets are lean.
"Meta data" (file names, availability) are present, but <b>no file content</b>:</li>
<pre class="fragment fade-in"><code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">$ datalad clone git@github.com:psychoinformatics-de/studyforrest-data-phase2.git
install(ok): /tmp/studyforrest-data-phase2 (dataset)
$ cd studyforrest-data-phase2 && du -sh
18M .</code></pre>
<li class="fragment fade-in">
files' contents can be retrieved on demand:
</li>
</ul>
<pre class="fragment fade-in"><code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">$ datalad get sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz
get(ok): /tmp/studyforrest-data-phase2/sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz (file) [from mddatasrc...]</code></pre>
<li class="fragment fade-in">Have access to more data on your computer than you have disk-space:</li>
<pre class="fragment fade-in"><code># eNKI dataset (1.5TB, 34k files):
$ du -sh
1.5G .
# HCP dataset (~200TB, >15 million files)
$ du -sh
48G . </code></pre>
</section>
<section data-markdown data-transition="None"> <script type="text/template">
## Plenty of data, but little disk-usage
Drop file content that is not needed:<!-- .element: class="fragment fade-in" -->
<pre class="fragment fade-in"><code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">$ datalad drop sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz
drop(ok): /tmp/studyforrest-data-phase2/sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz (file) [checking https://arxiv.org/pdf/0904.3664v1.pdf...]</code></pre>
When files are dropped, only "meta data" stays behind, and they can be re-obtained on demand.<!-- .element: class="fragment fade-in" -->
<pre><code class="python">dl.get('input/sub-01')
[really complex analysis]
dl.drop('input/sub-01')
</code></pre><!-- .element: class="fragment fade-in" -->
</script></section>
<section data-transition="None" style="vertical-align:top">
<h3>There are two version control tools at work - why?</h3>
<p class="fragment fade-in">Git does not handle large files well.
<div class="r-stack">
<img class="fragment" src="../pics/gitsnapshot.png">
</div>
</p>
</section>
<section data-transition="None">
<h3>There are two version control tools at work - why?</h3>
<p>Git does not handle large files well.
<img src="../pics/gitsnapshot2.png">
</p>
<p class="fragment fade-in">
And repository hosting services refuse to handle large files:
<img src="../pics/pushing_large_files_to_Git.png"></p>
<p style="z-index: 100;position: fixed; font-size:35px;margin-top:-450px;margin-bottom:300px;margin-left:1000px">
<img class="fragment" src="../pics/horrofied.png" height="380px"></p>
<p class="fragment fade-in">git-annex to the rescue! Let's take a look how it works</p>
</section>
<section>
<h2>Git versus Git-annex</h2>
<img height="500" src="../pics/artwork/src/publishing/publishing_gitvsannex.svg">
</section>
<section>
<h2>Dataset internals</h2>
<ul style="font-size:35px">
<li>Where the filesystem allows it, annexed files are symlinks:
<pre><code>$ ls -l sub-02/func/sub-02_task-oneback_run-01_bold.nii.gz
lrwxrwxrwx 1 adina adina 142 Jul 22 19:45 sub-02/func/sub-02_task-oneback_run-01_bold.nii.gz ->
../../.git/annex/objects/kZ/K5/MD5E-s24180157--aeb0e5f2e2d5fe4ade97117a8cc5232f.nii.gz/MD5E-s24180157
--aeb0e5f2e2d5fe4ade97117a8cc5232f.nii.gz
</code></pre><small>(PS: especially useful in datasets with many identical files) </small></li>
<li>The symlink reveals this internal data organization based on identity hash:
<pre><code>$ md5sum sub-02/func/sub-02_task-oneback_run-01_bold.nii.gz
aeb0e5f2e2d5fe4ade97117a8cc5232f sub-02/func/sub-02_task-oneback_run-01_bold.nii.gz
</code></pre></li>
<li class="fragment fade-in">The (tiny) symlink instead of the (potentially large) file content is
committed - version controlling precise file identity without checking contents into Git
<img src="../pics/annex-commit.png"></li>
<li class="fragment fade-in">File contents can be shared via almost all
standard infrastructure. File availability information is a decentral network.
A file can exist in multiple different locations.</li>
<pre class="fragment fade-in" ><code class="fragment fade-in" data-fragment-index="1">$ git annex whereis code/nilearn-tutorial.pdf
whereis code/nilearn-tutorial.pdf (2 copies)
cf13d535-b47c-5df6-8590-0793cb08a90a -- [datalad]
e763ba60-7614-4b3f-891d-82f2488ea95a -- jovyan@jupyter-adswa:~/my-analysis [here]
datalad: https://raw.githubusercontent.com/datalad-handbook/resources/master/nilearn-tutorial.pdf
</code></pre>
</ul>
<small><p >Delineation and advantages of decentral versus central RDM:<a href="https://doi.org/10.1515/nf-2020-0037" target="_blank">
Hanke et al., (2021). In defense of decentralized research data management</a></small>
</section>
<section>
<h2>Git versus Git-annex</h2>
<dl>
<dt>Data in datasets is either stored in Git or git-annex</dt>
<dd>By default, everything is <i>annexed</i>.</dd>
<small>
<table class="fragment fade-in">
<tr>
<td style="vertical-align: middle">
<strong>Two consequences:</strong>
<li>Annexed contents are not available right after cloning,
only content identity and availability information (as they are stored in Git).
Everything that is annexed needs to be retrieved with <code>datalad get</code>
from whereever it is stored.
</li>
<li>Files stored in Git are modifiable, annexed files are protected against accidental modifcations</li>
</td>
<td width="60%">
<img src="../pics/git_vs_gitannex.svg" height="500">
</td>
</tr>
</table>
<table class="fragment fade-in">
<tr>
<td><b>Git</b></td>
<td><b>git-annex</b></td>
</tr>
<tr>
<td>handles <b>small</b> files well (text, code)</td>
<td>handles <b>all</b> types and sizes of files well</td>
</tr>
<tr>
<td>file contents are in the Git history
and will be <b>shared</b> upon git/datalad push</td>
<td>file contents are in the annex. Not necessarily shared</td>
</tr>
<tr>
<td>Shared with every dataset clone</td>
<td><b>Can be kept private</b> on a per-file level when sharing the dataset</td>
</tr>
<tr>
<td>Useful: Small, non-binary, frequently modified, need-to-be-accessible (DUA, README) files </td>
<td>Useful: Large files, private files</td>
</tr>
</table>
</small>
<br><br><small>Useful background information for demo later. Read
<a href="http://handbook.datalad.org/en/latest/basics/101-115-symlinks.html" target="_blank">
this handbook chapter</a> for details
</a> </small>
</dl>
</section>
<section>
<h2>Git versus Git-annex</h2>
<ul>
Users can decide which files are annexed:
<br><br>
<li><b>Pre-made run-procedures</b>, provided by DataLad (e.g., <code>text2git</code>, <code>yoda</code>)
or created and shared by users
(<a href="http://handbook.datalad.org/en/latest/basics/101-124-procedures.html" target="_blank">Tutorial</a>) </li>
<li>Self-made configurations in <code>.gitattributes</code> (e.g., based on file type,
file/path name, size, ...; <a href="http://handbook.datalad.org/en/latest/basics/101-123-config2.html#gitattributes" target="_blank">
rules and examples
</a> )</li>
<li>Per-command basis (e.g., via <code>datalad save --to-git</code>)</li>
</ul>
</section>
<section data-transition="None">
<h2>Publishing datasets</h2>
I have a dataset on my computer. How can I share it, or collaborate on it?
<img height="900" src="../pics/startingpoint.svg">
</section>
<section data-transition="None">
<h2>Glossary</h2>
<dl style="font-size:30px">
<dt class="fragment fade-in" data-fragment-index="1">
Sibling (remote)</dt>
<dd class="fragment fade-in" data-fragment-index="1">
Linked clones of a dataset. You can usually update (from) siblings to keep all your siblings in sync
(e.g., ongoing data acquisition stored on experiment compute and backed up on cluster and external hard-drive)
</dd>
<dt class="fragment fade-in" data-fragment-index="2">
Repository hosting service</dt>
<dd class="fragment fade-in" data-fragment-index="2">
Webservices to host Git repositories, such as GitHub, GitLab, Bitbucket, Gin, ...</dd>
<dt class="fragment fade-in" data-fragment-index="3">
Third-party storage</dt>
<dd class="fragment fade-in" data-fragment-index="3">
Infrastructure (private/commercial/free/...) that can host data. A "special remote" protocol
is used to publish or pull data to and from it
</dd>
<dt class="fragment fade-in" data-fragment-index="4">
Publishing datasets</dt>
<dd class="fragment fade-in" data-fragment-index="4">
<em>Pushing</em> dataset contents (Git and/or annex) to a sibling using <strong>datalad push</strong></dd>
<dt class="fragment fade-in" data-fragment-index="5">
Updating datasets</dt>
<dd class="fragment fade-in" data-fragment-index="5">
<em>Pulling</em> new changes from a sibling using <strong>datalad update --merge</strong></dd>
</dl>
</section>
<section data-transition="None">
<h2>Publishing datasets</h2>
<ul>
<li>Most public datasets separate content in Git versus git-annex behind the scenes</li>
</ul>
<img height="900" src="../pics/artwork/src/publishing/publishing_network_gitvsannex.svg">
</section>
<section data-transition="None">
<h2>Publishing datasets</h2>
<img height="900" src="../pics/artwork/src/publishing/publishing_network_publishparts.svg">
</section>
<section data-transition="None">
<h2>Publishing datasets</h2>
<img height="900" src="../pics/artwork/src/publishing/publishing_network_publishparts2.svg">
</section>
<section data-transition="None">
<h2>Publishing datasets</h2>
Typical case:
<ul style="font-size:30px">
<li class="fragment fade-in">
Datasets are exposed via a private or public repository on a
repository hosting service
</li>
<li class="fragment fade-in">
Data can't be stored in the repository hosting service, but can be
kept in almost any third party storage
</li>
<li class="fragment fade-in">
Publication dependencies automate pushing to the correct place, e.g.,
<pre>
<code class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
$ git config --local remote.github.datalad-publish-depends gdrive
# or
$ datalad siblings add --name origin --url git@git.jugit.fzj.de:adswa/experiment-data.git --publish-depends s3
</code>
</pre>
</li>
</ul>
<img src="../pics/artwork/src/publishing/publishing_network_publishdepends.svg">
</section>
<section data-transition="None">
<h2>Publishing datasets</h2>
<p style="font-size:30px"> Special case 1: repositories with annex support</p>
<img height="850" class="fragment fade-in" src="../pics/artwork/src/publishing/publishing_network_publishgin.svg">
</section>
<section data-transition="None">
<h2>Publishing datasets</h2>
<p style="font-size:30px">Special case 2: Special remotes with repositories</p>
<img height="850" src="../pics/artwork/src/publishing/publishing_network_publishosf.svg">
</section>
<section>
<h2><code>Publishing to OSF</code></h2>
<p><a href="https://osf.io/">https://osf.io/</a></p>
<img src="../pics/git-annex-osf-logo.png" alt="datalad-osf-logo" width="50%">
</section>
<section style="text-align: left;">
<div style="display: flex !important; align-items: center">
<h2>create-sibling-osf</h2>&nbsp;<a href="https://docs.datalad.org/projects/osf/en/latest/" target="_blank">(docs)</a>
</div>
Requires the DataLad extensions <code>datalad-osf</code> and <code>datalad-next</code><br><br>
<ol>Prerequisites:
<li class="fragment">Log into OSF</li>
<li class="fragment">Create personal access token</li>
<li class="fragment">Enter credentials using <code>datalad osf-credentials</code>:</li>
</ol>
<div class="fragment">
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
datalad osf-credentials
</code>
</pre>
</div>
</section>
<section style="text-align: left;">
<div style="display: flex !important; align-items: center">
<h2>create-sibling-osf</h2>&nbsp;<a href="https://docs.datalad.org/projects/osf/en/latest/" target="_blank">(docs)</a>
</div>
<div>
Create the sibling in your dataset (different modes are possible):
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
datalad create-sibling-osf -d . -s my-osf-sibling \
--title 'my-osf-project-title' --mode export --public
</code>
</pre>
</div>
<div class="fragment">
Push to the sibling:
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
datalad push -d . --to my-osf-sibling
</code>
</pre>
</div>
<div class="fragment">
Clone from the sibling:
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
cd ..
datalad clone osf://my-osf-project-id my-osf-clone
</code>
</pre>
</div>
</section>
<section><script src="https://cdn.logwork.com/widget/countdown.js"></script>
<a href="https://logwork.com/countdown-2zu8" class="countdown-timer"
data-style="columns" data-timezone="Europe/Berlin" data-date="2023-09-28 15:30">
Quick break </a><br>
Next up: Your Questions and Usecases
</section>
</section>
<!-- QUESTIONS -->
<section>
<section>
<h2>Summary and Take-Home Messages</h2>
</section>
<section data-markdown data-transition="none"><script type="text/template">
## Exhaustive tracking of research components
![](../pics/vamp_0_start.png)<!-- .element: width="100%" -->
Well-structured datasets (using community standards), and portable computational environments &mdash; and their evolution &mdash; are the precondition for reproducibility
<table width=100% style="padding:0px">
<tr><td style="padding:0px">
<code><pre>
# turn any directory into a dataset
# with version control
% datalad create &lt;directory&gt;
</pre></code>
</td><td style="padding:0px">
<code><pre>
# save a new state of a dataset with
# file content of any size
% datalad save
</pre></code>
</td></tr></table>
Note:
- link to prev. statements on description standards
- your community could be really small (your lab), when data are precious resources
will be spent to understand it, but information must be capture to make this possible
</script></section>
<section data-markdown data-transition="none"><script type="text/template">
## Capture computational provenance
![](../pics/vamp_1_provcapture.png)<!-- .element: width="100%" -->
Which data was needed at which version, as input into which code, running with what parameterization in which
computional environment, to generate an outcome?
<table width=100% style="padding:0px">
<tr><td style="padding:0px">
<code><pre>
# execute any command and capture its output
# while recording all input versions too
% datalad run --input ... --output ... &lt;command&gt;
</pre></code>
</td></tr></table>
Note:
The missing link: even when everything is shared, we still don't know how to start.
README is minimum, but executable prov-records are much better.
</script></section>
<section data-markdown data-transition="none"><script type="text/template">
## Exhaustive capture enables portability
![](../pics/vamp_2_pushtocloud.png)<!-- .element: width="100%" -->
Precise identification of data and computational environments
combined with provenance records form a comprehensive and portable
data structure, capturing all aspects of an investigation.
<table width=100% style="padding:0px">
<tr><td style="padding:0px">
<code><pre>
# transfer data and metadata to other sites and services
# with fine-grained access control for dataset components
% datalad push --to &lt;site-or-service&gt;
</pre></code>
</td></tr></table>
Note:
Does it fly? Can you give it to someone? Or can you take it with you to your new lab?
</script></section>
<section data-markdown data-transition="none"><script type="text/template">
## Reproducibility strengthens trust
![](../pics/vamp_3_reproduce.png)<!-- .element: width="100%" -->
Outcomes of computational transformations can be validated by authorized 3rd-parties. This enables audits, promotes accountability, and streamlines automated "upgrades" of outputs
<table width=100% style="padding:0px">
<tr><td style="padding:0px">
<code><pre>
# obtain dataset (initially only identity,
# availability, and provenance metadata)
% datalad clone &lt;url&gt;
</pre></code>
</td><td style="padding:0px">
<code><pre>
# immediately actionable provenance records
# full abstraction of input data retrieval
% datalad rerun &lt;commit|tag|range&gt;
</pre></code>
</td></tr></table>
Note:
Goal is automated reproducibility, enables assessment of robustness and benchmarking algorithmic developments
</script></section>
<section data-markdown data-transition="none"><script type="text/template">
## Ultimate goal: (re-)usability
![](../pics/vamp_4_reuse.png)<!-- .element: width="100%" -->
Verifiable, portable, self-contained data structures that track all aspects of an investigation exhaustively can be (re-)used as modular components in larger contexts &mdash; propagating their traits
<table width=100% style="padding:0px">
<tr><td style="padding:0px">
<code><pre>
# declare a dependency on another dataset and
# re-use it a particular state in a new context
% datalad clone -d &lt;superdataset&gt; &lt;url&gt; &lt;path-in-dataset&gt;
</pre></code>
</td></tr></table>
Note:
With these in place, re-usability is a small(er) step
</script></section>
<section>
<h2>Your Questions and Usecases</h2>
</section>
<section>
<h2>Post-Workshop Contact</h2>
<ul>
<li class="fragment fade-in">Slides are CC-BY. They will stay online and will be made available as a PDF as well</li>
<li class="fragment fade-in">Contact the DataLad Team anytime via GitHub issue, Matrix chat message, or in our office hour video call</li>
<li class="fragment fade-in">Find more DataLad content and tutorials at <a href="https://handbook.datalad.org" target="_blank">handbook.datalad.org</a></li>
<br>
<li class="fragment fade-in">Join us at our first conference for distributed data management:
<a href="https://distribits.live/" target="_blank">distribits.live</a> (April 2024, registration closes October 15th)</li>
</ul>
<br><br>
<h3 class="fragment fade-in">Thanks for you attention!</h3>
</section>
<section style="text-align:left">
<h2>List of installed software on Jupyter</h2>
The JupyterHub runs on Ubuntu 22.04 via an AWS EC2 instance. The following packages were installed with different package managers:
<br><br>
<ul>
<li>apt: Git, git-annex, tree, tig, zsh, singularity</li>
<li>pip: datalad, datalad-next, datalad-container, datalad-osf, black</li>
</ul>
<br><br>
Instructions to set up and configure your own JupyterHub are publicly available at <a href="https://psychoinformatics-de.github.io/rdm-course/for_instructors/index.html" target="_blank">
psychoinformatics-de.github.io/rdm-course/for_instructors
</a>
<ul></ul>
</section>
</section>
<!--- OUTLOOK --->
<section>
<section>
<h2>Outlook</h2>
</section>
<section data-markdown data-transition="None"><script type="text/template">
## FAIRly big: Scaling up
Objective: Process the UK Biobank (imaging data)
![](../pics/biobank_website.png)<!-- .element: height="400" -->
- 76 TB in 43 million files in total
- 42,715 participants contributed personal health data
- Strict DUA
- Custom binary-only downloader
- Most data records offered as (unversioned) ZIP files
</script></section>
<section data-markdown data-transition="None"><script type="text/template">
## Challenges
- Process data such that
- Results are computationally reproducible (without the original compute infrastructure)
- There is complete linkage from results to an individual data record download
- It scales with the amount of available compute resources
- Data processing pipeline
- Compiled MATLAB blob
- 1h processing time per image, with 41k images to process
- 1.2 M output files (30 output files per input file)
- 1.2 TB total size of outputs
</script></section>
<section data-transition="None">
<h2> FAIRly big setup</h2>
<img src="../pics/fairlybig_ukbsetup.png" width="1200" style="margin-top:-35px;margin-bottom:-30px">
<ul style="font-size:30px">
<strong>Exhaustive tracking</strong>
<li><a href="https://github.com/datalad/datalad-ukbiobank" target="_blank">datalad-ukbiobank</a>
extension downloads, transforms & track the evolution of the complete data release
in DataLad datasets
</li>
<li>Native and BIDSified data layout (at no additional disk space usage)</li>
<li>Structured in 42k individual datasets, combined to one superdataset</li>
<li>Containerized pipeline in a software container</li>
<li>Link input data & computational pipeline as dependencies</li>
</ul>
<br><br>
<small><a href="https://www.nature.com/articles/s41597-022-01163-2" target="_blank">
Wagner, Waite, Wierzba et al. (2021). FAIRly big: A framework for computationally reproducible processing of large-scale data.</a>
</small>
</section>
<section data-transition="None">
<h2>FAIRly big workflow</h2>
<div class="r-stack">
<img class="fragment fade-out" src="../pics/fairlybig_workflow.png" width="1200" style="margin-top:-35px;margin-bottom:-30px">
<img src="../pics/htcondor.svg" class="fragment fade-in">
</div>
<br>
<ul style="font-size:30px">
<strong>portability</strong>
<li>Parallel processing: 1 job = 1 subject
(number of concurrent jobs capped at the capacity of the compute cluster)
</li>
<li>Each job is computed in a ephemeral (short-lived) dataset clone, results are pushed back:
Ensure exhaustive tracking &
portability during computation</li>
<li>Content-agnostic persistent (encrypted) storage (minimizing storage and inodes)</li>
<li>Common data representation in secure environments</li>
</ul>
<br><br>
<small><a href="https://www.nature.com/articles/s41597-022-01163-2" target="_blank">
Wagner, Waite, Wierzba et al. (2021). FAIRly big: A framework for computationally reproducible processing of large-scale data.</a>
</small></section>
<section data-transition="None">
<h2>FAIRly big provenance capture</h2>
<img src="../pics/fairlybig_prov.png" width="1200" style="margin-top:-35px;margin-bottom:-30px">
<br><br>
<ul style="font-size:30px">
<strong>Provenance</strong>
<li>Every single pipeline execution is tracked</li>
<li>Execution in ephemeral workspaces ensures results
individually reproducible without HPC access</li>
</ul>
<br><br>
<small><a href="https://www.nature.com/articles/s41597-022-01163-2" target="_blank">
Wagner, Waite, Wierzba et al. (2021). FAIRly big: A framework for computationally reproducible processing of large-scale data.</a>
</small></section>
<section data-markdown><script type="text/template">
## FAIRly big movie
<iframe width="1120" height="630" src="https://www.youtube-nocookie.com/embed/UsW6xN2f2jc?start=17" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
- Two computations on clusters of different scale (small cluster, supercomputer). Full video: https://youtube.com/datalad
- Two full (re-)computations, programmatically comparable, verifiable, reproducible -- on any system with data access
</script></section>
</section>
</div>
</div>
<script src="../reveal.js/dist/reveal.js"></script>
<script src="../reveal.js/plugin/notes/notes.js"></script>
<script src="../reveal.js/plugin/markdown/markdown.js"></script>
<script src="../reveal.js/plugin/highlight/highlight.js"></script>
<script src="../custom_functions.js"></script>
<script>
// More info about initialization & config:
// - https://revealjs.com/initialization/
// - https://revealjs.com/config/
Reveal.initialize({
hash: true,
// The "normal" size of the presentation, aspect ratio will be preserved
// when the presentation is scaled to fit different resolutions. Can be
// specified using percentage units.
width: 1280,
height: 960,
// Factor of the display size that should remain empty around the content
margin: 0.3,
// Bounds for smallest/largest possible scale to apply to content
minScale: 0.2,
maxScale: 1.0,
controls: true,
progress: true,
history: true,
center: true,
slideNumber: 'c',
pdfSeparateFragments: false,
pdfMaxPagesPerSlide: 1,
pdfPageHeightOffset: -1,
transition: 'slide', // none/fade/slide/convex/concave/zoom
// Learn about plugins: https://revealjs.com/plugins/
plugins: [ RevealMarkdown, RevealHighlight, RevealNotes ]
});
</script>
</body>
</html>