datalad-course/html/hida2026.html
2026-05-05 17:00:18 +02:00

1490 lines
62 KiB
HTML

<!doctype html>
<html>
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no">
<!-- Edit me start! -->
<title>Reproducibility with DataLad</title>
<meta name="description" content=" Reproducibility in science ">
<meta name="author" content=" Adina Wagner ">
<!-- Edit me end! -->
<link rel="stylesheet" href="../reveal.js/dist/reset.css">
<link rel="stylesheet" href="../reveal.js/dist/reveal.css">
<link rel="stylesheet" href="../reveal.js/dist/theme/beige.css">
<link rel="stylesheet" href="../css/main.css">
<!-- Theme used for syntax highlighted code -->
<link rel="stylesheet" href="../reveal.js/plugin/highlight/monokai.css">
</head>
<body>
<div class="reveal">
<div class="slides">
<!--...Datalad Basics...-->
<section>
<h2>Reproducibility in science</h2>
<h3>What it is and why to care, with examples from the DataLad World</h3>
<div style="margin-top:1em;text-align:center">
<table style="border: none;">
<tr>
<td style="border: none;">Adina Wagner
<br><small>
<a href="https://mas.to/@adswa" target="_blank">
<img data-src="../pics/mastodon.svg" style="height:30px;margin:0px" />
mas.to/@adswa</a></small></td>
<td style="border: none;">
<br></td>
</tr>
<tr>
<td style="border: none; vertical-align:top">
<small><a href="https://www.fz-juelich.de/en/inm/inm-7" target="_blank">Cognitive and Affective Biopsychology</a>,
<br> Institute of Neuroscience and
Medicine, Brain &amp; Behavior (INM-7)<br>
Research Center Jülich</small><br>
</td>
<td><img style="height:100px;margin-right:10px" data-src="../pics/fzj_logo.png" /></td>
</tr>
</table>
</div>
<p style="z-index: 100;position: fixed;background-color:#ede6d5;font-size:35px;box-shadow: 10px 10px 8px #888888;margin-top:0px;margin-bottom:100px;margin-left:1000px">
<img src="../pics/qr_hidarepro26.png" height="200">
</p>
<br><br><small>
Slides: <a href="https://doi.org/10.5281/zenodo.19692938" target="_blank">
DOI 10.5281/zenodo.19692938</a> (Scan the QR code) <br>
<a href="https://files.inm7.de/adina/talks/html/hida2026"
target="_blank">files.inm7.de/adina/talks/html/hida2026.html</a>
</small></a>
</section>
<section>
<h2>Logistics</h2>
<img style="vertical-align:center" src="../pics/qr_hida26notes.png" height="250px">
<ul>
<li>
<strong>QR Code</strong> - Crowdsourced notes, networking, & anonymous questions at <a href="https://hedgedoc.psychoinformatics.de/7X6uaPPAR2-wkskcP0W0-A#" target="_blank">
hedgedoc.psychoinformatics.de/7X6uaPPAR2-wkskcP0W0-A#</a>
</li>
<li>
JupyterHub: <a href="https://jupyter.edu.datalad.org" target="_blank">jupyter.edu.datalad.org</a>.
</li>
<li>
Collaboration Hub: <a href="https://hub.edu.datalad.org" target="_blank">hub.edu.datalad.org</a>.
</li>
<li><i>Didn't get a user name by email? Speak up!</i></li>
</ul>
</section>
<section>
<section data-transition="None">
<h2>Common problems in science</h2>
<div>
You write a paper about an algorithm, stay up
late to generate good-looking figures, but you have to tweak parameters and
display options to make it work AND look good. The next morning, you have no
idea which parameters produced which figures, and which of the figures
fits to what you report in the paper.<br>
<img height="400" src="../pics/turingway/findfiles.png">
<img height="400" src="../pics/turingway/projectstack.png"</div>
<imgcredit>Illustration adapted from Scriberia and The Turing Way</imgcredit>
</section>
<section data-transition="None">
<h2>Common problems in science</h2>
<div>
Your research project produces phenomenal results, but your laptop,
the only place that stores the source code for the results, is
stolen/breaks<br>
<img height="700" src="../pics/stolenlaptop.jpg"></div>
<imgcredit>https://co.pinterest.com/pin/551128073121451139//imgcredit>
</section>
<section data-transition="None">
<h2>Common problems in science</h2>
<div>
A graduate student approaches their supervisor, complaining that the
supervisors research idea does not work. After weeks of discussion,
it becomes apparent that oral communication doesn't suffice - the
student can't sufficiently explain the environment (data, algorithms,
...) they constructed, and if the supervisor can't enter and use the
students project there's no way to find a fix.
<br>
<img height="500" src="../pics/badsupervision.gif"></div>
<imgcredit>http://phdcomics.com/comics.php?f=1693</imgcredit>
</section>
<section data-transition="None">
<h2>Common problems in science</h2>
<div>
A Post-doc wrote a script during the PhD that applied a specific
method to a dataset. Now, with new data and a new project, they
try to reuse the script, but forgot how it worked.
<br>
<img height="500" src="../pics/frustration.jpg"></div>
<imgcredit>http://phdcomics.com/comics.php?f=1693</imgcredit>
</section>
<section data-transition="None">
<h2>common problems in science</h2>
<div>
You try to recreate results from another lab's published paper.
You base your re-implementation on everything reported in their paper,
but the results you obtain look nowhere like the original.
<br>
<img height="500" src="../pics/turingway/ReadableCode.png"></div>
<imgcredit>http://phdcomics.com/comics.php?f=1693</imgcredit>
</section>
<section>
<h2>Sounds familiar?</h2>
Did you encounter any of those in your work so far?
<table>
<tr>
<td style="width:40%">
<ol>
<li>Forgot how own results were generated</li>
<li>Lost single source of data</li>
<li>Miscommunication about analysis with supervisor</li>
<li>Can't get previous code to run</li>
<li>Failure to reproduce other's work</li>
<li>Something else related to reproducibility</li>
</ol>
</td>
<td>
<iframe src="https://directpoll.com/r?XDbzPBd3ixYqg8FLTFbS8naSCoKWa6nmjIlwFeDuQdOxY0",
style="border: 0" width="1500" height="700"></iframe>
</td>
</tr>
</table>
</section>
<section>
<h2><strike>common</strike> old problems in science</h2>
<div>
All these problems were paraphrased from
<a href="https://sci-hub.se/https://link.springer.com/chapter/10.1007%2F978-1-4612-2544-7_5" target="_blank">
Buckheit & Donoho, <b>1995</b></a>
<br><br><br></div>
<img class="fragment fade-in" data-fragment-index="1" src="../pics/munafo_nathumbehav_screenshot.png" style="box-shadow: 10px 10px 8px #888888;height=400px" height="400"><br>
<small class="fragment fade-in" data-fragment-index="1">"A manifesto for reproducible science" by Munafò et al., 2017, <i>Nature Human Behavior</i></small>
</section>
</section>
<section>
<section>
<h3>Definitions</h3>
<table>
<tr>
<td></td>
<td><b>Same data</b></td>
<td><b>New data</b></td>
</tr>
<tr>
<td><b>Same methods</b></td>
<td><p style="color:red">Reproducibility</p></td>
<td>Replication</td>
</tr>
<tr>
<td><b>New methods</b></td>
<td>Robustness</td>
<td>Generalization</td>
</tr>
</table>
<br><small>see e.g., Freese & Peterson, 2017</small><br><br>
<i>"Authors provide all the necessary data and the computer codes to run the analysis again, re-creating the results."</i> <a href="https://library.seg.org/doi/abs/10.1190/1.1822162" target="_blank"> - Claerbout & Karrenbach, <b>1992</b></a>
</section>
<section data-transition="None">
<h2>The road to reproducibility</h2>
<img src="../pics/reproduciblejourney.png">
<imgcredit>CC-BY Scriberia and <a href="https://the-turing-way.netlify.app/reproducible-research/rdm.html" target="_blank">
The Turing Way</a>
</imgcredit>
</section>
<section data-transition="None">
<dl>
<dt>The building blocks of a scientific result are rarely static</dt>
<table>
<tr>
<td style="vertical-align:middle">Analysis code evolves<br>
<small>(Fix bugs, add functions,
refactor, ...)</small></td>
<td>
<img src="../pics/final.png" height="500">
<imgcredit>Based on Piled Higher and Deeper
<a href="https://phdcomics.com/comics/archive_print.php?comicid=1531" target="_blank">
1531
</a> </imgcredit></td>
</tr>
</table>
</dl>
<img class="fragment fade-in" data-fragment-index="1" src="../pics/findfiles.png" height="400">
<img class="fragment fade-in" data-fragment-index="1" src="../pics/projectstack.png" height="350">
<imgcredit class="fragment fade-in" data-fragment-index="1" >Scriberia and <a href="https://the-turing-way.netlify.app">The Turing Way </a> (CC-BY)</imgcredit>
</section>
<section data-transition="None">
<h2>Version control</h2>
<table>
<tr>
<td>
<img src="../pics/turingway/ProjectHistory.png" width="500">
<imgcredit><a href="https://the-turing-way.netlify.app/reproducible-research/vcs/vcs-data.html" target="_blank">
CC-BY Scriberia & The Turing Way</a>
</imgcredit>
</td>
<td>
<ul style="font-size:35px">
<dt>Version control</dt>
<li>keep things organized</li>
<li>keep track of changes</li>
<li>revert changes or go <br>
back to previous states</li>
<li>collect and share digital provenance</li>
<li>industry standard: Git</li>
</ul>
</td>
</tr>
</table>
<img class="fragment fade-in" data-fragment-index="4" src="../pics/git.png" height="100px">
<img class="fragment fade-in" data-fragment-index="4" src="../pics/git-paper.png">
</section>
<section data-transition="None">
<dl>
<dt>The building blocks of a scientific result are rarely static</dt>
<table>
<tr>
<td style="vertical-align:middle">Data changes <br>
<small>(errors are fixed, data is extended,<br>
naming standards change, an analysis <br>
requires only a subset of your data...)</small></td>
<td>
<div class="r-stack">
<img src="../pics/phd052810s.png" height="400">
</div>
<imgcredit>Piled Higher and Deeper
<a href="https://phdcomics.com/comics/archive_print.php?comicid=1323" target="_blank">
1323
</a> </imgcredit></td>
</tr>
</table>
</dl>
<p class="fragment fade-in" data-fragment-index="2">
Large data version control (e.g., <a href="https://git-annex.branchable.com" target="_blank">git-annex</a>,
<a href="https://datalad.org" target="_blank">DataLad</a>)
<div class="r-stack">
<img class="fragment fade-in" data-fragment-index="2" src="../pics/tigdata.png">
<img class="fragment fade-in" data-fragment-index="3" src="../pics/tigdata3.png">
<img class="fragment fade-in" data-fragment-index="4" src="../pics/tigdata2.png">
</div>
</section>
<section data-transition="None">
<h2>Leaving a trace </h2>
<div class="r-stack">
<p class="fragment fade-out" data-fragment-index="1">"Shit, which version of which script produced these outputs from which version
of what data?"</p>
<p class="fragment fade-in" data-fragment-index="1">
"Shit, why buttons did I click and in which order did I use all those tools?"</p>
</div>
<div class="r-stack">
<p>
<img class="fragment fade-in-then-out" data-fragment-index="1" src="../pics/manuallabor.png">
<img class="fragment fade-out" data-fragment-index="2" src="../pics/findfiles.png" height="300">
<img class="fragment fade-out" data-fragment-index="2" src="../pics/projectstack.png" height="300">
<imgcredit>CC-BY Scriberia and <a href="https://the-turing-way.netlify.app/reproducible-research/rdm.html" target="_blank">
The Turing Way</a>
</imgcredit>
</p>
<p>
<img class="fragment fade-in" data-fragment-index="2" height="200px" src="../pics/file-management-manual-with-text.png">
<img class="fragment fade-in" data-fragment-index="3" height="200px" src="../pics/documentation.png">
<img class="fragment fade-in" data-fragment-index="4" height="200px" src="../pics/turingway/MachineReadable.png">
</p>
</div>
<div style="font-size:30px">
<p class="fragment fade-in" data-fragment-index="2">1) Create an intuitive structure, and </p>
<p class="fragment fade-in" data-fragment-index="3">2) write (plenty! of) documentation as you go, and<br></p>
<p class="fragment fade-in" data-fragment-index="4">
3) make your processes machine-readable <br><small>Tools and tricks: Perkel, 2020,
<a href="https://www.nature.com/articles/d41586-020-02462-7" target="_blank">
checklist for computational reproducibility
</a></small>
</p></div>
</section>
<section data-transition="None">
<h2>Methods documentation and provenance</h2>
Analytic flexibility leads to sizeable variations in results
<br><small>(see e.g., Carp. 2012 and Botvinik-Nezer, 2020 for examples from neuroimaging)</small><br>
<img src="../pics/sidney_harris_miracle.jpg" style="box-shadow: 10px 10px 8px #888888;height=500px" height="500"><br>
<ul>
<li>provide information on how data came into existence</li>
<li>change data through documented code, not manually</li>
<li>relate changes in data to changes in code</li>
</ul>
</section>
<section data-transition="None">
<h2>Reproducibility is highly technical</h2>
<img src="../pics/fragile.png" height="800">
<imgcredit>Based on <a href="https://xkcd.com/2347/" target="_blank">
xkcd.com/2347/</a> (CC-BY)</imgcredit>
<small><a href="https://www.youtube.com/watch?v=nTVcMDVlyOI" target="_blank">
Reproducibility Management in Neuroscience -
Specific Issues and Solutions</a>
(<a href="https://doi.org/10.5281/zenodo.4285927" target="_blank">DOI 10.5281/zenodo.4285927</a>) </small>
</section>
<section data-transition="None">
<h2>Your own reproducibility management</h2>
What tools do you use to aid with reproducible science?
</section>
</section>
<section>
<section data-transition="None">
<h2>Let's try DataLad</h2>
<dl style="font-size:37px">
<a href="https://datalad-hub.inm7.de" target="_blank">jupyter.edu.datalad.org</a>
<dt>username:</dt>
<dd>You got it per email (your first name)</dd>
<dt>password:</dt>
<dd>Set at first login, at least 8 characters</dd>
</dl>
</section>
<section style="text-align: left;" data-transition="None">
<h3>Git identity setup</h3>
Check Git identity:
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
git config --get user.name
git config --get user.email
</code>
</pre>
<div class="fragment">
Configure Git identity:
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
git config --global user.name "Adina Wagner"
git config --global user.email "adina.wagner@t-online.de"
</code>
</pre>
</div>
<div class="fragment">
Configure DataLad to use latest features:
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
git config --global --add datalad.extensions.load next
</code>
</pre>
</div>
</section>
<section style="text-align: left;" data-transition="None">
<h3>Using DataLad in a terminal</h3>
Check the installed version:
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
datalad --version
</code>
<p id="displayArea"></p>
</pre>
<div class="fragment">
For help on using DataLad from the command line:
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
datalad --help
</code>
The help may be displayed in a pager - exit it by pressing "q"
</pre>
</div>
<div class="fragment">
For extensive info about the installed package, its dependencies, and extensions, use <code>datalad wtf</code>.
Let's find out what kind of system we're on:
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
datalad wtf -S system
</code>
</pre>
</div>
</section>
<section style="text-align: left;" data-transition="None">
<h3>Using datalad via its Python API</h3>
Open a Python environment:
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
ipython
</code>
</pre>
<div class="fragment">
Import and start using:
<pre style="margin-left: 0;">
<code data-trim class="language-python" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
import datalad.api as dl
dl.create(path='mydataset')
</code>
</pre>
</div>
<div class="fragment">
Exit the Python environment:
<pre style="margin-left: 0;">
<code data-trim class="language-python" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
exit
</code>
</pre>
</div>
</section>
</section>
<section>
<section data-transition="None">
<h3 style="text-align: left;">Datalad datasets...</h3>
<img src="../pics/comic_box4.svg" alt="">
</section>
<section data-transition="None" style="text-align: left;">
<h3>...Datalad datasets</h3>
Create a dataset (here, with the <code>yoda</code> configuration, which adds
a helpful structure and configuration for data analyses): <br>
<img height="100px" src="../pics/yoda.png">
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
datalad create -c yoda my-analysis
</code>
</pre>
<div class="fragment">
Let's have a look inside. Navigate using <code>cd</code> (change directory):
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
cd my-analysis
</code>
</pre>
</div>
<div class="fragment">
List the directory content, including hidden files, with <code>ls</code>:
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
ls -la .
</code>
</pre>
</div>
</section>
</section>
<section>
<section data-transition="None">
<h3 style="text-align: left;">Version control...</h3>
<img src="../pics/comic_box5.svg" alt="">
</section>
<section data-transition="None" style="text-align: left;">
<h3>...Version control</h3>
The yoda-configuration added a README placeholder in the dataset.
Let's add Markdown text (a project title) to it:
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
echo "# My example DataLad dataset\n\nContains a small data analysis for my project" >| README.md
</code>
</pre>
<div class="fragment">
Now we can check the <code>status</code> of the dataset:
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
datalad status
</code>
</pre>
</div>
<div class="fragment">
We can save the state with <code>save</code>
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
datalad save -m "Adjust boilerplate README to project"
</code>
</pre>
</div>
<div class="fragment">
Let's add code for a data analysis from an external source:
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
wget https://hub.datalad.org/edu/scripts/raw/branch/main/iris/classification_analysis.py -O code/classification_analysis.py
</code>
</pre>
</div>
<div class="fragment">
Save again:
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
datalad save -m "Add analysis script"
</code>
</pre>
</div>
</section>
<section data-transition="None" style="text-align: left;">
<h3>...Version control</h3>
<div class="fragment">
Now, let's check the dataset history:
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
git log
</code>
</pre>
</div>
<div class="fragment">
We can also make the history prettier:
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
tig
</code>
(navigate with arrow keys and enter, press "q" to go back and exit the program)
</pre>
</div>
</section>
<section data-transition="None">
<h2>Local version control</h2>
<p>Procedurally, version control is easy with DataLad!</p>
<img src="../pics/local_wf.svg" height="500"> <!-- .element: class="fragment" -->
<br>
<b>Advice:</b>
<ul>
<li>Save <i>meaningful</i> units of change</li>
<li>Attach helpful commit messages</li>
</ul>
</section>
</section>
<section>
<section data-transition="None">
<h3 style="text-align: left;">Computationally reproducible execution I...</h3>
<img src="../pics/comic_box7.svg" width="65%" alt="">
<ul>
<li class="fragment fade-in-then-semi-out">which script/pipeline version</li>
<li class="fragment fade-in-then-semi-out">was run on which version of the data</li>
<li class="fragment fade-in-then-semi-out">to produce which version of the results?</li>
</ul>
</section>
<section data-transition="None" style="text-align:left;">
<h3>... Computationally reproducible execution I</h3>
<div class="fragment">
A variety of processes can modify files. A simple example: Code formatting
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">black code/classification_analysis.py</code>
</pre>
</div>
<div class="fragment">
Version control makes changes transparent:
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">git diff</code>
</pre>
</div>
<div class="fragment">
But its useful to keep track beyond that. Let's discard the latest changes...
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">git restore code/classification_analysis.py</code>
</pre>
</div>
<div class="fragment">
... and record precisely what we did
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">datalad run -m "Reformat code with black" \
"black code/classification_analysis.py"</code>
</pre>
</div>
<div class="fragment">
let's take a look:
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">git show</code>
</pre>
</div>
<div class="fragment">
... and repeat!
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">datalad rerun</code>
</pre>
</div>
</section>
</section>
<section>
<section data-transition="None">
<h3 style="text-align: left;">Data consumption & transport...</h3>
<img src="../pics/comic_box6_consumption.svg" alt="">
</section>
<section data-transition="None" style="text-align: left;">
<h3>...Data consumption & transport...</h3>
You can install a dataset from remote URL (or local path) using <code>clone</code>.
Either as a stand-alone entity:
<pre style="margin-left: 0;">
<code data-trim class="language-bash" >
# just an example:
datalad clone \
https://github.com/psychoinformatics-de/studyforrest-data-phase2.git
</code>
</pre>
<div class="fragment">
Or as linked dataset, nested in another dataset in a superdataset-subdataset hierarchy:
<pre style="margin-left: 0;">
<code data-trim class="language-bash" >
# just an example:
datalad clone -d . \
https://github.com/psychoinformatics-de/studyforrest-data-phase2.git
</code>
</pre>
<img src="../pics/linkage_subds.png" alt="">
</div>
<ul style="font-size:30px" class="fragment">
<li>Helps with scaling (see e.g. the <a href="https://github.com/datalad-datasets/human-connectome-project-openaccess" target="_blank">Human Connectome Project dataset</a> )</li>
<li>Version control tools struggle with >100k files</li>
<li>Modular units improves intuitive structure and reuse potential</li>
<li>Versioned linkage of inputs for reproducibility</li>
</ul>
</section>
<section data-transition="None" style="text-align: left;">
<h3>...Dataset nesting</h3>
Let's make a nest!
<div class="fragment">
Clone a dataset with analysis data into a specific
location ("input/") in the existing dataset,
making it a <em>sub</em>dataset:
<pre style="margin-left: 0;">
<code class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">datalad clone --dataset . \
https://hub.datalad.org/edu/iris_data.git \
input/</code>
</pre>
</div>
<div class="fragment">
Let's see what changed in the dataset, using the <code>subdatasets</code> command:
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
datalad subdatasets
</code>
</pre>
</div>
<div class="fragment">
... and also <code>git show</code>:
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
git show
</code>
</pre>
</div>
</section>
<section data-transition="None" style="text-align:left;">
<div class="fragment">
We can now view the cloned dataset's file tree:
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
cd input
ls
</code>
</pre>
</div>
<div class="fragment">
...and also its history
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
tig
</code>
</pre>
</div>
<div class="fragment">
Let's check the dataset size (with the <code>du</code> disk-usage command):
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
du -sh
</code>
</pre>
</div>
<div class="fragment">
Let's check the <em>actual</em> dataset size:
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
datalad status --annex
</code>
</pre>
</div>
<div class="fragment">
Let's check try to print the file contents into the terminal (<code>cat</code>):
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
cat iris.csv
</code>
</pre>
</div>
</section>
<section data-transition="None" style="text-align: left;">
<h3>...Data consumption & transport</h3>
We can retrieve actual file content with <code>get</code>:
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
datalad get iris.csv
</code>
</pre>
<div class="fragment">
If we don't need a file locally anymore, we can <code>drop</code> its content:
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
datalad drop iris.csv</code>
</pre>
</div>
<div class="fragment">
No need to store all files locally, or archive results with
Giga/Terra-Bytes of source data:
<pre><code class="python">dl.get('input/sub-01')
[really complex analysis]
dl.drop('input/sub-01')</code></pre>
If data is published anywhere, your data analysis can carry an actionable link to it,
with barely any space requirements.
</div>
</section>
<section>
<h2>Git versus Git-annex</h2>
<dl>
<dt>Data in datasets is either stored in Git or git-annex</dt>
<dd>By default, everything is <i>annexed</i>, i.e., stored in a dataset annex by git-annex</dd><br>
<img height="400" src="../pics/artwork/src/publishing/publishing_gitvsannex.svg">
<br><br>
<li class="fragment fade-in-then-semi-out">With annexed data, only content identity (hash)
and location information is put into Git, rather than file content.
The annex, and transport to and from it is managed with <b>git-annex</b>
</dl>
</section>
<section data-transition="None">
<h2>Git versus Git-annex</h2>
<dl>
<dt>Configurations (e.g., YODA), custom <a href="http://handbook.datalad.org/en/latest/basics/101-123-config2.html" target="_blank">
rules</a>, or command parametrization determines if a file is annexed</dt>
<dd>Storing files in Git or git-annex has distinct advantages:</dd><br>
<br>
<table >
<tr style="font-size:35px">
<td><b>Git</b></td>
<td><b>git-annex</b></td>
</tr>
<tr style="font-size:30px">
<td>handles <b>small</b> files well (text, code)</td>
<td>handles <b>all</b> types and sizes of files well</td>
</tr>
<tr style="font-size:30px">
<td>file contents are in the Git history
and will be <b>shared</b> upon git/datalad push</td>
<td>file contents are in the annex. Not necessarily shared</td>
</tr>
<tr style="font-size:30px">
<td>Shared with every dataset clone</td>
<td><b>Can be kept private</b> on a per-file level when sharing the dataset</td>
</tr>
<tr style="font-size:30px">
<td>Useful: Small, non-binary, frequently modified, need-to-be-accessible (DUA, README) files </td>
<td>Useful: Large files, private files</td>
</tr>
</table>
<br><br>
<div style="text-align:center" class="fragment">YODA configures the contents of the <code>code/</code>
directory and the dataset descriptions (e.g., README files) to be in Git.
There are many other configurations, and you can also
<a href="http://handbook.datalad.org/en/latest/basics/101-124-procedures.html" target="_blank">
write your own</a>.<br>
<img height="100px" src="../pics/yoda.png">
</div>
</dl>
</section>
<section data-transition="None" style="text-align: left;">
<h3>...Computationally reproducible execution...</h3>
Try to execute the downloaded analysis script. Does it work?
<div><pre style="margin-left: 0;"><code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
cd ..
python code/classification_analysis.py</code></pre></div>
<ul class="fragment">
<li>
Software can be difficult or impossible to install (e.g. conflicts with existing software,
or on HPC) for you or your collaborators
</li>
<li>
Different software versions/operating systems can produce different results:
<a href="https://doi.org/10.3389/fninf.2015.00012" target="_blank">Glatard et al., doi.org/10.3389/fninf.2015.00012</a>
</li>
<li class="fragment fade-in">
<strong>Software containers</strong> encapsulate a software environment and isolate it from
a surrounding operating system. Two common solutions: Docker, Singularity
</li>
</ul>
</section>
<section data-transition="None" style="text-align: left;">
<h3>...Computationally reproducible execution</h3>
<div class="fragment">
With the <code>datalad-container</code> extension, we can add software containers
to datasets and work with them.
Let's add a software container with Python software to run the script
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
datalad containers-add python-env --url shub://adswa/resources:2
</code>
</pre>
</div>
<div class="fragment">
inspect the list of registered containers:
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
datalad containers-list
</code>
</pre>
</div>
<div class="fragment">
Now, let's try out the <code>containers-run</code> command:
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
datalad containers-run -m "run classification analysis in python environment" \
--container-name python-env \
--input "input/iris.csv" \
--output "pairwise_relationships.png" \
--output "prediction_report.csv" \
"python3 code/classification_analysis.py {inputs} {outputs}"
</code>
</pre>
</div>
<div class="fragment">
What changed after the <code>containers-run</code> command has completed?
<br>
We can use <code>datalad diff</code> (based on <code>git diff</code>):
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
datalad diff -f HEAD~1
</code>
</pre>
</div>
<div class="fragment">
We see that some files were added to the dataset!
<br>
And we have a complete provenance record as part of the git history:
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
git log -n 1
</code>
</pre>
</div>
</section>
<section data-transition="None" style="text-align: left;">
<h3>...Computationally reproducible execution...</h3>
<ul>
<li class="fragment fade-in-then-semi-out">The <code>datalad run</code>
can run any command in a way that links the command or script to the
results it produces and the data it was computed from</li>
<li class="fragment fade-in-then-semi-out">The <code>datalad rerun</code>
can take this recorded provenance and recompute the command</li>
<li class="fragment fade-in-then-semi-out">The <code>datalad containers-run</code>
(from the extension "datalad-container") can capture software provenance in the form of software containers in addition to the provenance that datalad run captures</li>
</ul>
<br><br>
</section>
</section>
<section>
<section data-markdown data-transition="none"><script type="text/template">
## "Share data like sourcecode"
Datasets can be cloned, pushed, and updated from and to **local** and **remote** paths, **remote hosting services**, external **special remotes**
![](../pics/artwork/src/publishing/startingpoint.svg)
<div class="fragment">We will use Forgejo-aneksajo: <a href="https://hub.edu.datalad.org/" target="_blank">hub.edu.datalad.org</a>:</div>
</script></section>
<section data-markdown data-transition="none"><script type="text/template">
## Objective: Publish the dataset to Forgejo
**Preparation: Obtain a token**
Go to <a href="https://hub.edu.datalad.org/user/settings" target="_blank">hub.edu.datalad.org/user/settings</a>
<div class="r-stack">
<img src="../pics/forgejo-token2.png">
<img class="fragment" src="../pics/forgejo-token3.png">
</div>
</script></section>
<section data-transition="none">
<h2>Objective: Publish the dataset to Forgejo</h2>
<div>
<ul>
<li>Credential prep:</li>
</ul>
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
git config --global credential.helper 'store --file ~/.git-credentials'
</code>
</pre>
</div>
<div>
<ul>
<li>Create a new repository <code>my-analysis</code> in the webinterface: <a href="https://hub.edu.datalad.org/repo/create" target="_blank">https://hub.edu.datalad.org/repo/create</a></li>
<li>Register a sibling / remote URL in the <code>my-analysis</code> dataset, using the URL
<a href="https://hub.edu.datalad.org/USER-NAME/my-analysis.git" target="_blank">https://hub.edu.datalad.org/USER-NAME/my-analysis.git</a>
(replace USER-NAME with your forgejo account name):</li>
</ul>
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
git remote add origin https://hub.edu.datalad.org/USER-NAME/my-analysis.git
</code>
</pre>
</div>
<div>
<ul>
<li>Push the dataset and its file contents. What gets reported in your terminal?</li>
</ul>
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
datalad push --to origin</code>
</pre>
<small>(Supply your account name and the token as password when prompted in the terminal!)</small>
</div>
<br><br>
<h3>In the forgejo webinterface, explore your newly created repository.</h3>
</script></section>
<section data-transition="none">
<h2>Objective: Clone your neighbours dataset</h2>
<div>
<ul>
<li>Clone your right neighbours dataset (replace USER-NAME with <em>their</em> forgejo account name).
Make sure you're not inside your own dataset.</li>
</ul>
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
datalad clone https://hub.edu.datalad.org/USER-NAME/my-analysis.git other-analysis</code>
</pre>
</div>
<div>
<ul>
<li>Find the commit hash of their run commit. Rerun their analyses</li>
</ul>
<pre style="margin-left: 0;">
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
datalad rerun HASH</code>
</pre>
</div>
</section>
<section data-markdown data-transition="none"><script type="text/template">
**Objective: Stay up to date**
- While "push" publishes new developments, "datalad update" fetches or pulls them.
- "datalad update" <em>fetches</em>, "datalad update --how merge" <em>pulls</em> updates.
- "-s" declares the sibling to update from.
- "-r" performs a recursive update.
- Try pushing and pulling an update yourself.
```
datalad update --how merge -s origin
```
<!-- .element: style="font-size:75%" -->
</script></section>
<section data-transition="None">
<h2>Share the ingredients, but also the recipe!</h2>
<img src="../pics/agoodstart3.png">
<imgcredit>CC-BY Scriberia and <a href="https://the-turing-way.netlify.app/reproducible-research/rdm.html" target="_blank">
The Turing Way</a>
</imgcredit>
</section>
</section>
<!-------Examples-------->
<section>
<section data-transition="None">
<h3>But what's in it for me? "Selfish" reasons for reproducibility</h3>
<small>"[...] science is all about more publications, more impact factor, more money and more career. More, more, more ...<br>So how does working reproducibly help me achieve more as a scientist?" - Markowetz, 2015</small><br><br>
<div>
<ul>
<li>You want to avoid the disaster of publishing "a miracle"</li>
<li>You will be faster (in the long run)
<ul>
<li>Finding and fixing errors will be faster</li>
<li>Progress on new projects will happen faster</li>
</ul>
</li>
<li>Researchers (reviewers!) will have more trust in your findings</li>
<li>Data sharing can foster collaboration (with your past self, inside and outside your institution) and lead to new projects and publications</li>
<li>You acquire (technical) skills that will likely become increasingly important for your career, either in academia or industry</li><br>
</ul></div>
<div>
<i><b>It's just useful for your everyday work and makes your life easier!</b></i><br></div>
<br><br><small>see e.g., Markowetz, 2015, <i>Genome Biology</i>; Poldrack, 2019, <i>Neuron</i></small>
</section>
<section data-transition="None">
<h2>DataLad</h2>
<img style="height:300px; margin-top: 0; margin-right:1px;vertical-align:middle;" src="../pics/comic_box3.svg" alt="">
<br>
<ul style="font-size:37px">
<li>Domain-agnostic <strong>command-line tool</strong>
(+ <strong>graphical user interface</strong>),
built on top of <a href="https://git-scm.com/" target="_blank">Git</a>
& <a href="https://git-annex.branchable.com/" target="_blank">Git-annex</a></li>
<li>Major features:</li>
<dt>Version-controlling arbitrarily large content </dt>
<dd>Version control data & software alongside to code!</dd>
<dt>Transport mechanisms for sharing & obtaining data </dt>
<dd>Consume & collaborate on data (analyses) like software</dd>
<dt>(Computationally) reproducible data analysis</dt>
<dd>Track and share provenance of all digital objects</dd>
<dt>(... and <i>much</i> more) </dt>
<br>
</ul>
</section>
<section data-transition="None">
<h2>Further resources and stay in touch</h2>
<ul>
If you have questions after the workshop...
<br><br>
<ul style="font-size:35px">
<dt>Reach out to the <b>DataLad</b> team via</dt>
<li>
<a href="https://matrix.to/#/!NaMjKIhMXhSicFdxAj:matrix.org?via=matrix.waite.eu&via=matrix.org&via=inm7.de" target="_blank">
Matrix</a> (free, decentralized communication app, no app needed).
We run a weekly Zoom office hour (Monday, 2pm Berlin time) from this room as well.
</li>
<li>
<a href="https://github.com/datalad/datalad" target="_blank">
The development repository on GitHub</a>
</li>
<br>
<dt>Reach out to the (Neuro-) user community with</dt>
<li>A question on <a href="https://neurostars.org/" target="_blank">neurostars.org</a>
with a <code>datalad</code> tag</li>
<br>
<dt>Find more user tutorials or workshop recordings</dt>
<li>On <a href="https://www.youtube.com/datalad" target="_blank">
DataLad's YouTube channel</a>
</li>
<li>
In the <a href="http://handbook.datalad.org/en/latest/" target="_blank">
DataLad Handbook </a>
</li>
<li>In the <a href="https://psychoinformatics-de.github.io/rdm-course/" target="_blank">DataLad RDM course</a> </li>
<li>In the <a href="http://docs.datalad.org" target="_blank">Official API documentation</a> </li>
<li> In an overview of most tutorials, talks, videos at
<a href="https://github.com/datalad/tutorials" target="_blank">github.com/datalad/tutorials</a> </li>
</ul>
</ul>
</section>
<section data-transition="None">
<h2>Acknowledgements</h2>
<table>
<tr style="vertical-align:middle">
<td style="vertical-align:middle">
<dl>
<dt style="margin-top:20px">DataLad software <br>
& ecosystem</dt>
<dd style="margin-left:5px!important">
<ul style="margin-left:5px!important">
<li>Psychoinformatics Lab, <br>
Research center Jülich</li>
<li>Center for Open <br>
Neuroscience, <br>
Dartmouth College</li>
<li>Joey Hess (git-annex)</li>
<li><em>>100 additional contributors</em></li>
</ul>
</dd>
</td>
<td style="vertical-align:middle">
<div style="margin-bottom:-20px;text-align:center"><strong>Funders</strong></div>
<img style="height:150px;margin-right:50px" data-src="../pics/nsf.png" />
<img style="height:150px;margin-right:50pxi;margin-left:50px" data-src="../pics/binc.png" />
<img style="height:150px;margin-left:50px" data-src="../pics/bmbf.png" />
<div style="margin-top:-20px">
<img style="height:80px;margin-top:-40px;margin-left:40px" data-src="../pics/fzj_logo.svg" />
<img style="height:60px;margin-left:50px;margin-bottom:25px" data-src="../pics/dfg_logo.png" />
</div>
<div style="margin-top:-20px">
<img style="height:60px;margin-right:20px" data-src="../pics/erdf.png" />
<img style="height:60px;margin-right:20px" data-src="../pics/cbbs_logo.png" />
<img style="height:60px" data-src="../pics/LSA-Logo.png" />
</div>
<div style="margin-top:40px;margin-bottom:20px;text-align:center"><strong>Collaborators</strong></div>
<div style="margin-top:-20px">
<img style="height:100px;margin:20px" data-src="../pics/hbp_logo.png" />
<img style="height:100px;margin:20px" data-src="../pics/conp_logo.png" />
<img style="height:120px;margin:10px" data-src="../pics/openneuro_logo.png" />
</div>
<div style="margin-top:-40px">
<img style="height:100px;margin:20px" data-src="../pics/ebrains-logo.png"/>
<img style="height:100px;margin:0px" data-src="../pics/gin-logo.png" />
<img style="height:120px;margin:10px" data-src="../pics/sfb1451_logo.png" />
</div>
<div style="margin-top:-40px;align:middle">
<img style="height:140px;margin:10px" data-src="../pics/brainlife_logo.png" />
<img style="height:100px;margin:0px" data-src="../pics/cbrain_logo.png" />
<img style="height:100px;margin:20px" data-src="../pics/vbc_logo.png" />
</div>
</td>
</tr>
</table>
</section>
<section data-transition="None">
<h2>Thank you for your attention!</h2>
<img src="../pics/qr_hidarepro26.png" height="400">
<br><br><small>
Slides: <a href="https://doi.org/10.5281/zenodo.19692938" target="_blank">
DOI 10.5281/zenodo.19692938</a> (Scan the QR code)
<br><br>
</small>
<table>
<tr>
</tr>
<tr style="vertical-align:middle">
<td style="vertical-align:middle">
<img src="../pics/winrepo.png">
</td>
<td style="font-size: 18px">
<br><br>
Women neuroscientists are <a href="https://onlinelibrary.wiley.com/doi/full/10.1111/ejn.14397" target="_blank">
underrepresented in neuroscience</a>. You can use the <br>
<a href="https://www.winrepo.org/" target="_blank"> Repository for Women in Neuroscience</a> to find
and recommend neuroscientists for <br>
conferences, symposia or collaborations, and help making neuroscience more open & divers.
</td>
</tr>
</table>
</section>
</section>
<section>
<section>
<h2>How does this relate to reproducibility?</h2>
</section>
<section data-transition="None">
<h2>Exhaustive tracking</h2>
<dl style="font-size:35px">
<dt>The building blocks of a scientific result are rarely static</dt>
<table>
<tr>
<td style="vertical-align:middle">Data changes <br>
<small>(errors are fixed, data is extended,<br>
naming standards change, an analysis <br>
requires only a subset of your data...)</small></td>
<td><img src="../pics/phd052810s.png" height="500">
<imgcredit>Piled Higher and Deeper
<a href="https://phdcomics.com/comics/archive_print.php?comicid=1323" target="_blank">
1323
</a> </imgcredit></td>
</tr>
</table>
</dl>
</section>
<section data-transition="None">
<h3>Exhaustive tracking</h3>
Once you track changes to data with version control tools,
you can find out <em>why</em> it changed, <em>what</em> has changed, <em>when</em> it changed,
and <em>which version</em> of your data was used at which point in time.
<div class="r-stack">
<img height="450px" class="fragment fade-out" data-fragment-index="1" src="../pics/tigdata.png">
<img height="450px" class="fragment" data-fragment-index="1" src="../pics/tigdata3.png">
<img height="450px" class="fragment" src="../pics/tigdata2.png">
</div>
</section>
<section>
<h2>Digital provenance</h2>
<ul>
<p >
= <i>"The tools and processes used to create a
digital file, the responsible entity, and when and where the process
events occurred"</i>
</p>
<li class="fragment fade-in">
Have you ever saved a PDF to read later onto your computer, but forgot
where you got it from? Or did you ever find a figure in your project,
but forgot which analysis step produced it?
</li>
<img src="../pics/Provenance_alpha.png">
<imgcredit data-fragment-index="1" >Scriberia and <a href="https://the-turing-way.netlify.app">The Turing Way </a> (CC-BY)</imgcredit>
</ul>
</section>
<section data-transition="None">
<h3>Data transport: Security and reliability - for data</h3>
Decentral version control for data integrates with a variety of services
to let you store data in different places - creating a resilient network for data
<img src="../pics/decentral_RDM_overview_left.png">
<small> <a href="https://doi.org/10.1515/nf-2020-0037" target="_blank">"In defense of decentralized Research Data Management", doi.org/10.1515/nf-2020-0037</a> </small>
</section>
<section data-transition="None">
<h3>Ultimate goal: Reusability</h3>
Teamscience on more than code:
<img src="../pics/teamscience.png">
<img class="fragment" src="../pics/datahistory.png">
</section>
<section>
<h3>DataLad usecases</h3>
<div class="r-stack">
<li data-fragment-index="1" class="fragment fade-in-then-out"> <b>Publish or consume datasets</b>
via GitHub, GitLab, OSF, the European Open Science Cloud, or similar services</li>
<li data-fragment-index="2" class="fragment fade-in-then-out">
Behind-the-scenes <b>infrastructure component for data transport and versioning</b>
(e.g., used by <a href="https://openneuro.org/" target="_blank"> OpenNeuro</a>,
<a href="https://brainlife.io/" target="_blank"> brainlife.io </a>,
the <a href="https://conp.ca/" target="_blank">Canadian Open Neuroscience Platform (CONP)</a>,
<a href="https://mcin.ca/technology/cbrain/" target="_blank"> CBRAIN</a>)</li>
<li data-fragment-index="3" class="fragment fade-in-then-out"><b>Central data management</b> and archival system</li>
<li data-fragment-index="4" class="fragment fade-in-then-out"><b>Decentral data and metadata catalog</b></li>
<li data-fragment-index="5" class="fragment fade-in-then-out"> <b>Creating and sharing reproducible, open science</b>: Sharing data, software, code, and provenance </li>
</div>
<div class="r-stack">
<img data-fragment-index="1" height="700" class="fragment fade-in-then-out" src="../pics/getdata_studyforrest.gif" alt="a screenrecording of cloning studyforrest data from github">
<img height="700" class="fragment fade-in-then-out" data-fragment-index="2" src="../pics/openneuro_new_2.gif" alt="a screenrecording of browsing open neuro">
<img height="700" data-fragment-index="3" class="fragment fade-in-then-out" src="../pics/centralmanagement2.gif">
<img height="1000" data-fragment-index="4" class="fragment fade-in-then-out" src="../pics/sfb-catalog.gif">
<img height="700" class="fragment fade-in" data-fragment-index="5" src="../pics/remodnavpaper_2.gif" alt="a screenrecording of cloning REMODNAV paper dataset from github">
</div>
</section>
<section data-transition="None">
<h2>A common usecase</h2>
<div style="margin-top:0.5em;">
<table style="border: none;table-layout: fixed;">
<tr>
<td width="60%"><img style="height:500px; margin-top: 0; margin-right:1px;vertical-align:middle;" data-src="../pics/comic_box1.svg" /></td>
<td>
<ul style="vertical-align:middle;">
<li class="fragment fade-in">
Alice is a PhD student in a research team.</li>
<li class="fragment fade-in">
She works on a fairly typical research project:
Data collection & processing.</li>
<li class="fragment fade-in">
First sample → final result = complex process</li>
</ul>
</td>
</tr>
</table>
</div><br>
<h3 class="fragment fade-in">How does Alice go about her daily job?</h3>
</section>
<section data-transition="None">
<h2>A common usecase</h2>
<ul>
<li class="fragment fade-in">
In her project, Alice likes to have an automated record of:
<ul>
<li>when a given file was last changed</li>
<li>where it came from</li>
<li>what input files were used to generate a given output</li>
<li>why some things were done.</li>
</ul>
</li>
<br>
<li class="fragment fade-in">
Even if she doesn't share her work, this is essential for her future self</li>
<li class="fragment fade-in">
Her project is exploratory: Frequent changes to her analysis scripts</li>
<li class="fragment fade-in">
She enjoys the comfort of being able to return to a previously recorded state</li>
</ul>
<br><br>
<h3 class="fragment fade-in">This is: *local version control*</h3>
</section>
<section data-transition="None">
<h2>A common usecase</h2>
<ul>
<li class="fragment fade-in" data-fragment-index="1">
Alice's work is not confined to a single computer:
<ul>
<li>Laptop / desktop / remote server / dedicated back-up</li>
<li>Alice wants to automatically & efficiently synchronize</li>
</ul>
</li>
<br>
<li class="fragment fade-in" data-fragment-index="2">
Parts of the data are collected or analyzed by colleagues.
This requires:
<ul>
<li>distributed synchronization with centralized storage</li>
<li>preservation of origin & authorship of changes</li>
<li>effective combination of simultaneous contributions</li>
</ul>
</li>
</ul>
<br><br>
<h3 class="fragment fade-in" data-fragment-index="3">This is: *distributed version control*</h3>
</section>
<section data-transition="None">
<h2>A common usecase</h2>
<ul>
<li class="fragment fade-in">
Alice applies local version control for her own work, and reproducibly records it
</li>
<li class="fragment fade-in">
She also applies distributed version control when working with colleagues
and collaborators
</li>
<li class="fragment fade-in">
She often needs to work on a subset of data at any given time:
<ul>
<li>all files are kept on a server</li>
<li>a few files are rotated into and out of her laptop</li>
</ul>
</li>
<li class="fragment fade-in">
Alice wants to publish the data at project's end:
<ul>
<li>raw data / outputs / both</li>
<li>completely or selectively</li>
</ul>
</li>
</ul>
<br><br>
<h3 class="fragment fade-in">This is: *data management (with DataLad 😀)*</h3>
</section>
</section>
</div>
</div>
<script src="../reveal.js/dist/reveal.js"></script>
<script src="../reveal.js/plugin/notes/notes.js"></script>
<script src="../reveal.js/plugin/markdown/markdown.js"></script>
<script src="../reveal.js/plugin/highlight/highlight.js"></script>
<script src="../custom_functions.js"></script>
<script>
// More info about initialization & config:
// - https://revealjs.com/initialization/
// - https://revealjs.com/config/
Reveal.initialize({
hash: true,
// The "normal" size of the presentation, aspect ratio will be preserved
// when the presentation is scaled to fit different resolutions. Can be
// specified using percentage units.
width: 1280,
height: 960,
// Factor of the display size that should remain empty around the content
margin: 0.1,
// Bounds for smallest/largest possible scale to apply to content
minScale: 0.2,
maxScale: 1.5,
controls: true,
progress: true,
history: true,
center: true,
slideNumber: 'c',
pdfSeparateFragments: false,
pdfMaxPagesPerSlide: 1,
pdfPageHeightOffset: -1,
transition: 'slide', // none/fade/slide/convex/concave/zoom
// Learn about plugins: https://revealjs.com/plugins/
plugins: [ RevealMarkdown, RevealHighlight, RevealNotes ]
});
</script>
</body>
</html>