1490 lines
62 KiB
HTML
1490 lines
62 KiB
HTML
<!doctype html>
|
|
<html>
|
|
<head>
|
|
<meta charset="utf-8">
|
|
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no">
|
|
|
|
<!-- Edit me start! -->
|
|
<title>Reproducibility with DataLad</title>
|
|
<meta name="description" content=" Reproducibility in science ">
|
|
<meta name="author" content=" Adina Wagner ">
|
|
<!-- Edit me end! -->
|
|
|
|
<link rel="stylesheet" href="../reveal.js/dist/reset.css">
|
|
<link rel="stylesheet" href="../reveal.js/dist/reveal.css">
|
|
<link rel="stylesheet" href="../reveal.js/dist/theme/beige.css">
|
|
<link rel="stylesheet" href="../css/main.css">
|
|
<!-- Theme used for syntax highlighted code -->
|
|
<link rel="stylesheet" href="../reveal.js/plugin/highlight/monokai.css">
|
|
</head>
|
|
<body>
|
|
<div class="reveal">
|
|
<div class="slides">
|
|
|
|
<!--...Datalad Basics...-->
|
|
|
|
<section>
|
|
<h2>Reproducibility in science</h2>
|
|
<h3>What it is and why to care, with examples from the DataLad World</h3>
|
|
|
|
<div style="margin-top:1em;text-align:center">
|
|
<table style="border: none;">
|
|
<tr>
|
|
<td style="border: none;">Adina Wagner
|
|
<br><small>
|
|
<a href="https://mas.to/@adswa" target="_blank">
|
|
<img data-src="../pics/mastodon.svg" style="height:30px;margin:0px" />
|
|
mas.to/@adswa</a></small></td>
|
|
<td style="border: none;">
|
|
<br></td>
|
|
</tr>
|
|
<tr>
|
|
<td style="border: none; vertical-align:top">
|
|
<small><a href="https://www.fz-juelich.de/en/inm/inm-7" target="_blank">Cognitive and Affective Biopsychology</a>,
|
|
<br> Institute of Neuroscience and
|
|
Medicine, Brain & Behavior (INM-7)<br>
|
|
Research Center Jülich</small><br>
|
|
</td>
|
|
<td><img style="height:100px;margin-right:10px" data-src="../pics/fzj_logo.png" /></td>
|
|
</tr>
|
|
</table>
|
|
</div>
|
|
<p style="z-index: 100;position: fixed;background-color:#ede6d5;font-size:35px;box-shadow: 10px 10px 8px #888888;margin-top:0px;margin-bottom:100px;margin-left:1000px">
|
|
<img src="../pics/qr_hidarepro26.png" height="200">
|
|
</p>
|
|
<br><br><small>
|
|
|
|
Slides: <a href="https://doi.org/10.5281/zenodo.19692938" target="_blank">
|
|
DOI 10.5281/zenodo.19692938</a> (Scan the QR code) <br>
|
|
<a href="https://files.inm7.de/adina/talks/html/hida2026"
|
|
target="_blank">files.inm7.de/adina/talks/html/hida2026.html</a>
|
|
</small></a>
|
|
</section>
|
|
|
|
|
|
<section>
|
|
<h2>Logistics</h2>
|
|
<img style="vertical-align:center" src="../pics/qr_hida26notes.png" height="250px">
|
|
<ul>
|
|
<li>
|
|
<strong>QR Code</strong> - Crowdsourced notes, networking, & anonymous questions at <a href="https://hedgedoc.psychoinformatics.de/7X6uaPPAR2-wkskcP0W0-A#" target="_blank">
|
|
hedgedoc.psychoinformatics.de/7X6uaPPAR2-wkskcP0W0-A#</a>
|
|
</li>
|
|
|
|
<li>
|
|
JupyterHub: <a href="https://jupyter.edu.datalad.org" target="_blank">jupyter.edu.datalad.org</a>.
|
|
</li>
|
|
<li>
|
|
Collaboration Hub: <a href="https://hub.edu.datalad.org" target="_blank">hub.edu.datalad.org</a>.
|
|
</li>
|
|
<li><i>Didn't get a user name by email? Speak up!</i></li>
|
|
</ul>
|
|
</section>
|
|
|
|
|
|
<section>
|
|
|
|
<section data-transition="None">
|
|
<h2>Common problems in science</h2>
|
|
<div>
|
|
You write a paper about an algorithm, stay up
|
|
late to generate good-looking figures, but you have to tweak parameters and
|
|
display options to make it work AND look good. The next morning, you have no
|
|
idea which parameters produced which figures, and which of the figures
|
|
fits to what you report in the paper.<br>
|
|
<img height="400" src="../pics/turingway/findfiles.png">
|
|
<img height="400" src="../pics/turingway/projectstack.png"</div>
|
|
<imgcredit>Illustration adapted from Scriberia and The Turing Way</imgcredit>
|
|
</section>
|
|
|
|
<section data-transition="None">
|
|
<h2>Common problems in science</h2>
|
|
<div>
|
|
Your research project produces phenomenal results, but your laptop,
|
|
the only place that stores the source code for the results, is
|
|
stolen/breaks<br>
|
|
<img height="700" src="../pics/stolenlaptop.jpg"></div>
|
|
<imgcredit>https://co.pinterest.com/pin/551128073121451139//imgcredit>
|
|
</section>
|
|
|
|
<section data-transition="None">
|
|
<h2>Common problems in science</h2>
|
|
<div>
|
|
A graduate student approaches their supervisor, complaining that the
|
|
supervisors research idea does not work. After weeks of discussion,
|
|
it becomes apparent that oral communication doesn't suffice - the
|
|
student can't sufficiently explain the environment (data, algorithms,
|
|
...) they constructed, and if the supervisor can't enter and use the
|
|
students project there's no way to find a fix.
|
|
<br>
|
|
<img height="500" src="../pics/badsupervision.gif"></div>
|
|
<imgcredit>http://phdcomics.com/comics.php?f=1693</imgcredit>
|
|
</section>
|
|
|
|
<section data-transition="None">
|
|
<h2>Common problems in science</h2>
|
|
<div>
|
|
A Post-doc wrote a script during the PhD that applied a specific
|
|
method to a dataset. Now, with new data and a new project, they
|
|
try to reuse the script, but forgot how it worked.
|
|
<br>
|
|
<img height="500" src="../pics/frustration.jpg"></div>
|
|
<imgcredit>http://phdcomics.com/comics.php?f=1693</imgcredit>
|
|
</section>
|
|
|
|
<section data-transition="None">
|
|
<h2>common problems in science</h2>
|
|
<div>
|
|
You try to recreate results from another lab's published paper.
|
|
You base your re-implementation on everything reported in their paper,
|
|
but the results you obtain look nowhere like the original.
|
|
<br>
|
|
<img height="500" src="../pics/turingway/ReadableCode.png"></div>
|
|
<imgcredit>http://phdcomics.com/comics.php?f=1693</imgcredit>
|
|
</section>
|
|
|
|
<section>
|
|
<h2>Sounds familiar?</h2>
|
|
Did you encounter any of those in your work so far?
|
|
<table>
|
|
<tr>
|
|
<td style="width:40%">
|
|
<ol>
|
|
<li>Forgot how own results were generated</li>
|
|
<li>Lost single source of data</li>
|
|
<li>Miscommunication about analysis with supervisor</li>
|
|
<li>Can't get previous code to run</li>
|
|
<li>Failure to reproduce other's work</li>
|
|
<li>Something else related to reproducibility</li>
|
|
</ol>
|
|
</td>
|
|
<td>
|
|
<iframe src="https://directpoll.com/r?XDbzPBd3ixYqg8FLTFbS8naSCoKWa6nmjIlwFeDuQdOxY0",
|
|
style="border: 0" width="1500" height="700"></iframe>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
</section>
|
|
|
|
<section>
|
|
<h2><strike>common</strike> old problems in science</h2>
|
|
<div>
|
|
All these problems were paraphrased from
|
|
<a href="https://sci-hub.se/https://link.springer.com/chapter/10.1007%2F978-1-4612-2544-7_5" target="_blank">
|
|
Buckheit & Donoho, <b>1995</b></a>
|
|
<br><br><br></div>
|
|
<img class="fragment fade-in" data-fragment-index="1" src="../pics/munafo_nathumbehav_screenshot.png" style="box-shadow: 10px 10px 8px #888888;height=400px" height="400"><br>
|
|
<small class="fragment fade-in" data-fragment-index="1">"A manifesto for reproducible science" by Munafò et al., 2017, <i>Nature Human Behavior</i></small>
|
|
</section>
|
|
</section>
|
|
|
|
<section>
|
|
<section>
|
|
<h3>Definitions</h3>
|
|
|
|
<table>
|
|
<tr>
|
|
<td></td>
|
|
<td><b>Same data</b></td>
|
|
<td><b>New data</b></td>
|
|
</tr>
|
|
<tr>
|
|
<td><b>Same methods</b></td>
|
|
<td><p style="color:red">Reproducibility</p></td>
|
|
<td>Replication</td>
|
|
</tr>
|
|
<tr>
|
|
<td><b>New methods</b></td>
|
|
<td>Robustness</td>
|
|
<td>Generalization</td>
|
|
</tr>
|
|
</table>
|
|
<br><small>see e.g., Freese & Peterson, 2017</small><br><br>
|
|
<i>"Authors provide all the necessary data and the computer codes to run the analysis again, re-creating the results."</i> <a href="https://library.seg.org/doi/abs/10.1190/1.1822162" target="_blank"> - Claerbout & Karrenbach, <b>1992</b></a>
|
|
</section>
|
|
|
|
<section data-transition="None">
|
|
<h2>The road to reproducibility</h2>
|
|
<img src="../pics/reproduciblejourney.png">
|
|
|
|
<imgcredit>CC-BY Scriberia and <a href="https://the-turing-way.netlify.app/reproducible-research/rdm.html" target="_blank">
|
|
The Turing Way</a>
|
|
</imgcredit>
|
|
</section>
|
|
|
|
<section data-transition="None">
|
|
<dl>
|
|
<dt>The building blocks of a scientific result are rarely static</dt>
|
|
<table>
|
|
<tr>
|
|
<td style="vertical-align:middle">Analysis code evolves<br>
|
|
<small>(Fix bugs, add functions,
|
|
refactor, ...)</small></td>
|
|
<td>
|
|
<img src="../pics/final.png" height="500">
|
|
<imgcredit>Based on Piled Higher and Deeper
|
|
<a href="https://phdcomics.com/comics/archive_print.php?comicid=1531" target="_blank">
|
|
1531
|
|
</a> </imgcredit></td>
|
|
</tr>
|
|
</table>
|
|
</dl>
|
|
<img class="fragment fade-in" data-fragment-index="1" src="../pics/findfiles.png" height="400">
|
|
<img class="fragment fade-in" data-fragment-index="1" src="../pics/projectstack.png" height="350">
|
|
<imgcredit class="fragment fade-in" data-fragment-index="1" >Scriberia and <a href="https://the-turing-way.netlify.app">The Turing Way </a> (CC-BY)</imgcredit>
|
|
|
|
</section>
|
|
|
|
<section data-transition="None">
|
|
<h2>Version control</h2>
|
|
<table>
|
|
<tr>
|
|
<td>
|
|
<img src="../pics/turingway/ProjectHistory.png" width="500">
|
|
<imgcredit><a href="https://the-turing-way.netlify.app/reproducible-research/vcs/vcs-data.html" target="_blank">
|
|
CC-BY Scriberia & The Turing Way</a>
|
|
</imgcredit>
|
|
</td>
|
|
<td>
|
|
<ul style="font-size:35px">
|
|
<dt>Version control</dt>
|
|
<li>keep things organized</li>
|
|
<li>keep track of changes</li>
|
|
<li>revert changes or go <br>
|
|
back to previous states</li>
|
|
<li>collect and share digital provenance</li>
|
|
<li>industry standard: Git</li>
|
|
</ul>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
<img class="fragment fade-in" data-fragment-index="4" src="../pics/git.png" height="100px">
|
|
<img class="fragment fade-in" data-fragment-index="4" src="../pics/git-paper.png">
|
|
</section>
|
|
|
|
<section data-transition="None">
|
|
<dl>
|
|
<dt>The building blocks of a scientific result are rarely static</dt>
|
|
<table>
|
|
<tr>
|
|
<td style="vertical-align:middle">Data changes <br>
|
|
<small>(errors are fixed, data is extended,<br>
|
|
naming standards change, an analysis <br>
|
|
requires only a subset of your data...)</small></td>
|
|
<td>
|
|
<div class="r-stack">
|
|
<img src="../pics/phd052810s.png" height="400">
|
|
|
|
</div>
|
|
<imgcredit>Piled Higher and Deeper
|
|
<a href="https://phdcomics.com/comics/archive_print.php?comicid=1323" target="_blank">
|
|
1323
|
|
</a> </imgcredit></td>
|
|
</tr>
|
|
</table>
|
|
</dl>
|
|
<p class="fragment fade-in" data-fragment-index="2">
|
|
Large data version control (e.g., <a href="https://git-annex.branchable.com" target="_blank">git-annex</a>,
|
|
<a href="https://datalad.org" target="_blank">DataLad</a>)
|
|
<div class="r-stack">
|
|
<img class="fragment fade-in" data-fragment-index="2" src="../pics/tigdata.png">
|
|
<img class="fragment fade-in" data-fragment-index="3" src="../pics/tigdata3.png">
|
|
<img class="fragment fade-in" data-fragment-index="4" src="../pics/tigdata2.png">
|
|
</div>
|
|
</section>
|
|
|
|
<section data-transition="None">
|
|
<h2>Leaving a trace </h2>
|
|
<div class="r-stack">
|
|
<p class="fragment fade-out" data-fragment-index="1">"Shit, which version of which script produced these outputs from which version
|
|
of what data?"</p>
|
|
<p class="fragment fade-in" data-fragment-index="1">
|
|
"Shit, why buttons did I click and in which order did I use all those tools?"</p>
|
|
</div>
|
|
<div class="r-stack">
|
|
<p>
|
|
<img class="fragment fade-in-then-out" data-fragment-index="1" src="../pics/manuallabor.png">
|
|
<img class="fragment fade-out" data-fragment-index="2" src="../pics/findfiles.png" height="300">
|
|
<img class="fragment fade-out" data-fragment-index="2" src="../pics/projectstack.png" height="300">
|
|
<imgcredit>CC-BY Scriberia and <a href="https://the-turing-way.netlify.app/reproducible-research/rdm.html" target="_blank">
|
|
The Turing Way</a>
|
|
</imgcredit>
|
|
</p>
|
|
<p>
|
|
<img class="fragment fade-in" data-fragment-index="2" height="200px" src="../pics/file-management-manual-with-text.png">
|
|
<img class="fragment fade-in" data-fragment-index="3" height="200px" src="../pics/documentation.png">
|
|
<img class="fragment fade-in" data-fragment-index="4" height="200px" src="../pics/turingway/MachineReadable.png">
|
|
</p>
|
|
</div>
|
|
<div style="font-size:30px">
|
|
<p class="fragment fade-in" data-fragment-index="2">1) Create an intuitive structure, and </p>
|
|
<p class="fragment fade-in" data-fragment-index="3">2) write (plenty! of) documentation as you go, and<br></p>
|
|
<p class="fragment fade-in" data-fragment-index="4">
|
|
3) make your processes machine-readable <br><small>Tools and tricks: Perkel, 2020,
|
|
<a href="https://www.nature.com/articles/d41586-020-02462-7" target="_blank">
|
|
checklist for computational reproducibility
|
|
</a></small>
|
|
</p></div>
|
|
</section>
|
|
|
|
<section data-transition="None">
|
|
<h2>Methods documentation and provenance</h2>
|
|
Analytic flexibility leads to sizeable variations in results
|
|
<br><small>(see e.g., Carp. 2012 and Botvinik-Nezer, 2020 for examples from neuroimaging)</small><br>
|
|
<img src="../pics/sidney_harris_miracle.jpg" style="box-shadow: 10px 10px 8px #888888;height=500px" height="500"><br>
|
|
<ul>
|
|
<li>provide information on how data came into existence</li>
|
|
<li>change data through documented code, not manually</li>
|
|
<li>relate changes in data to changes in code</li>
|
|
</ul>
|
|
</section>
|
|
|
|
<section data-transition="None">
|
|
<h2>Reproducibility is highly technical</h2>
|
|
<img src="../pics/fragile.png" height="800">
|
|
<imgcredit>Based on <a href="https://xkcd.com/2347/" target="_blank">
|
|
xkcd.com/2347/</a> (CC-BY)</imgcredit>
|
|
<small><a href="https://www.youtube.com/watch?v=nTVcMDVlyOI" target="_blank">
|
|
Reproducibility Management in Neuroscience -
|
|
Specific Issues and Solutions</a>
|
|
(<a href="https://doi.org/10.5281/zenodo.4285927" target="_blank">DOI 10.5281/zenodo.4285927</a>) </small>
|
|
</section>
|
|
|
|
<section data-transition="None">
|
|
<h2>Your own reproducibility management</h2>
|
|
|
|
What tools do you use to aid with reproducible science?
|
|
</section>
|
|
</section>
|
|
|
|
<section>
|
|
<section data-transition="None">
|
|
<h2>Let's try DataLad</h2>
|
|
<dl style="font-size:37px">
|
|
<a href="https://datalad-hub.inm7.de" target="_blank">jupyter.edu.datalad.org</a>
|
|
<dt>username:</dt>
|
|
<dd>You got it per email (your first name)</dd>
|
|
<dt>password:</dt>
|
|
<dd>Set at first login, at least 8 characters</dd>
|
|
</dl>
|
|
</section>
|
|
|
|
<section style="text-align: left;" data-transition="None">
|
|
<h3>Git identity setup</h3>
|
|
Check Git identity:
|
|
<pre style="margin-left: 0;">
|
|
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
|
git config --get user.name
|
|
git config --get user.email
|
|
</code>
|
|
</pre>
|
|
|
|
<div class="fragment">
|
|
Configure Git identity:
|
|
<pre style="margin-left: 0;">
|
|
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
|
git config --global user.name "Adina Wagner"
|
|
git config --global user.email "adina.wagner@t-online.de"
|
|
</code>
|
|
</pre>
|
|
</div>
|
|
|
|
<div class="fragment">
|
|
Configure DataLad to use latest features:
|
|
<pre style="margin-left: 0;">
|
|
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
|
git config --global --add datalad.extensions.load next
|
|
</code>
|
|
</pre>
|
|
</div>
|
|
|
|
</section>
|
|
|
|
<section style="text-align: left;" data-transition="None">
|
|
<h3>Using DataLad in a terminal</h3>
|
|
|
|
Check the installed version:
|
|
<pre style="margin-left: 0;">
|
|
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
|
datalad --version
|
|
</code>
|
|
<p id="displayArea"></p>
|
|
</pre>
|
|
|
|
<div class="fragment">
|
|
For help on using DataLad from the command line:
|
|
<pre style="margin-left: 0;">
|
|
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
|
datalad --help
|
|
</code>
|
|
The help may be displayed in a pager - exit it by pressing "q"
|
|
</pre>
|
|
</div>
|
|
|
|
<div class="fragment">
|
|
For extensive info about the installed package, its dependencies, and extensions, use <code>datalad wtf</code>.
|
|
Let's find out what kind of system we're on:
|
|
<pre style="margin-left: 0;">
|
|
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
|
datalad wtf -S system
|
|
</code>
|
|
</pre>
|
|
</div>
|
|
</section>
|
|
|
|
|
|
<section style="text-align: left;" data-transition="None">
|
|
<h3>Using datalad via its Python API</h3>
|
|
Open a Python environment:
|
|
<pre style="margin-left: 0;">
|
|
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
|
ipython
|
|
</code>
|
|
</pre>
|
|
<div class="fragment">
|
|
Import and start using:
|
|
<pre style="margin-left: 0;">
|
|
<code data-trim class="language-python" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
|
import datalad.api as dl
|
|
dl.create(path='mydataset')
|
|
</code>
|
|
</pre>
|
|
</div>
|
|
<div class="fragment">
|
|
Exit the Python environment:
|
|
<pre style="margin-left: 0;">
|
|
<code data-trim class="language-python" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
|
exit
|
|
</code>
|
|
</pre>
|
|
</div>
|
|
</section>
|
|
</section>
|
|
|
|
<section>
|
|
<section data-transition="None">
|
|
<h3 style="text-align: left;">Datalad datasets...</h3>
|
|
<img src="../pics/comic_box4.svg" alt="">
|
|
</section>
|
|
|
|
|
|
<section data-transition="None" style="text-align: left;">
|
|
<h3>...Datalad datasets</h3>
|
|
Create a dataset (here, with the <code>yoda</code> configuration, which adds
|
|
a helpful structure and configuration for data analyses): <br>
|
|
<img height="100px" src="../pics/yoda.png">
|
|
<pre style="margin-left: 0;">
|
|
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
|
datalad create -c yoda my-analysis
|
|
</code>
|
|
</pre>
|
|
|
|
<div class="fragment">
|
|
Let's have a look inside. Navigate using <code>cd</code> (change directory):
|
|
<pre style="margin-left: 0;">
|
|
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
|
cd my-analysis
|
|
</code>
|
|
</pre>
|
|
</div>
|
|
|
|
<div class="fragment">
|
|
List the directory content, including hidden files, with <code>ls</code>:
|
|
<pre style="margin-left: 0;">
|
|
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
|
ls -la .
|
|
</code>
|
|
</pre>
|
|
</div>
|
|
</section>
|
|
</section>
|
|
|
|
<section>
|
|
<section data-transition="None">
|
|
<h3 style="text-align: left;">Version control...</h3>
|
|
<img src="../pics/comic_box5.svg" alt="">
|
|
</section>
|
|
|
|
|
|
<section data-transition="None" style="text-align: left;">
|
|
<h3>...Version control</h3>
|
|
The yoda-configuration added a README placeholder in the dataset.
|
|
Let's add Markdown text (a project title) to it:
|
|
<pre style="margin-left: 0;">
|
|
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
|
echo "# My example DataLad dataset\n\nContains a small data analysis for my project" >| README.md
|
|
</code>
|
|
</pre>
|
|
|
|
<div class="fragment">
|
|
Now we can check the <code>status</code> of the dataset:
|
|
<pre style="margin-left: 0;">
|
|
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
|
datalad status
|
|
</code>
|
|
</pre>
|
|
</div>
|
|
|
|
<div class="fragment">
|
|
We can save the state with <code>save</code>
|
|
<pre style="margin-left: 0;">
|
|
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
|
datalad save -m "Adjust boilerplate README to project"
|
|
</code>
|
|
</pre>
|
|
</div>
|
|
|
|
<div class="fragment">
|
|
Let's add code for a data analysis from an external source:
|
|
<pre style="margin-left: 0;">
|
|
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
|
wget https://hub.datalad.org/edu/scripts/raw/branch/main/iris/classification_analysis.py -O code/classification_analysis.py
|
|
</code>
|
|
</pre>
|
|
</div>
|
|
|
|
<div class="fragment">
|
|
Save again:
|
|
<pre style="margin-left: 0;">
|
|
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
|
datalad save -m "Add analysis script"
|
|
</code>
|
|
</pre>
|
|
</div>
|
|
</section>
|
|
|
|
<section data-transition="None" style="text-align: left;">
|
|
<h3>...Version control</h3>
|
|
<div class="fragment">
|
|
Now, let's check the dataset history:
|
|
<pre style="margin-left: 0;">
|
|
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
|
git log
|
|
</code>
|
|
</pre>
|
|
</div>
|
|
|
|
<div class="fragment">
|
|
We can also make the history prettier:
|
|
<pre style="margin-left: 0;">
|
|
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
|
tig
|
|
</code>
|
|
(navigate with arrow keys and enter, press "q" to go back and exit the program)
|
|
</pre>
|
|
</div>
|
|
|
|
</section>
|
|
|
|
<section data-transition="None">
|
|
<h2>Local version control</h2>
|
|
|
|
<p>Procedurally, version control is easy with DataLad!</p>
|
|
<img src="../pics/local_wf.svg" height="500"> <!-- .element: class="fragment" -->
|
|
<br>
|
|
|
|
<b>Advice:</b>
|
|
<ul>
|
|
<li>Save <i>meaningful</i> units of change</li>
|
|
<li>Attach helpful commit messages</li>
|
|
</ul>
|
|
</section>
|
|
</section>
|
|
|
|
<section>
|
|
<section data-transition="None">
|
|
<h3 style="text-align: left;">Computationally reproducible execution I...</h3>
|
|
<img src="../pics/comic_box7.svg" width="65%" alt="">
|
|
<ul>
|
|
<li class="fragment fade-in-then-semi-out">which script/pipeline version</li>
|
|
<li class="fragment fade-in-then-semi-out">was run on which version of the data</li>
|
|
<li class="fragment fade-in-then-semi-out">to produce which version of the results?</li>
|
|
</ul>
|
|
</section>
|
|
|
|
<section data-transition="None" style="text-align:left;">
|
|
<h3>... Computationally reproducible execution I</h3>
|
|
<div class="fragment">
|
|
A variety of processes can modify files. A simple example: Code formatting
|
|
<pre style="margin-left: 0;">
|
|
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">black code/classification_analysis.py</code>
|
|
</pre>
|
|
</div>
|
|
|
|
<div class="fragment">
|
|
Version control makes changes transparent:
|
|
<pre style="margin-left: 0;">
|
|
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">git diff</code>
|
|
</pre>
|
|
</div>
|
|
|
|
<div class="fragment">
|
|
But its useful to keep track beyond that. Let's discard the latest changes...
|
|
<pre style="margin-left: 0;">
|
|
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">git restore code/classification_analysis.py</code>
|
|
</pre>
|
|
</div>
|
|
|
|
<div class="fragment">
|
|
... and record precisely what we did
|
|
<pre style="margin-left: 0;">
|
|
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">datalad run -m "Reformat code with black" \
|
|
"black code/classification_analysis.py"</code>
|
|
</pre>
|
|
</div>
|
|
|
|
<div class="fragment">
|
|
let's take a look:
|
|
<pre style="margin-left: 0;">
|
|
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">git show</code>
|
|
</pre>
|
|
</div>
|
|
|
|
<div class="fragment">
|
|
... and repeat!
|
|
<pre style="margin-left: 0;">
|
|
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">datalad rerun</code>
|
|
</pre>
|
|
</div>
|
|
</section>
|
|
</section>
|
|
|
|
<section>
|
|
<section data-transition="None">
|
|
<h3 style="text-align: left;">Data consumption & transport...</h3>
|
|
<img src="../pics/comic_box6_consumption.svg" alt="">
|
|
</section>
|
|
|
|
|
|
<section data-transition="None" style="text-align: left;">
|
|
<h3>...Data consumption & transport...</h3>
|
|
|
|
You can install a dataset from remote URL (or local path) using <code>clone</code>.
|
|
Either as a stand-alone entity:
|
|
<pre style="margin-left: 0;">
|
|
<code data-trim class="language-bash" >
|
|
# just an example:
|
|
datalad clone \
|
|
https://github.com/psychoinformatics-de/studyforrest-data-phase2.git
|
|
</code>
|
|
</pre>
|
|
|
|
<div class="fragment">
|
|
Or as linked dataset, nested in another dataset in a superdataset-subdataset hierarchy:
|
|
<pre style="margin-left: 0;">
|
|
<code data-trim class="language-bash" >
|
|
# just an example:
|
|
datalad clone -d . \
|
|
https://github.com/psychoinformatics-de/studyforrest-data-phase2.git
|
|
</code>
|
|
</pre>
|
|
<img src="../pics/linkage_subds.png" alt="">
|
|
</div>
|
|
<ul style="font-size:30px" class="fragment">
|
|
<li>Helps with scaling (see e.g. the <a href="https://github.com/datalad-datasets/human-connectome-project-openaccess" target="_blank">Human Connectome Project dataset</a> )</li>
|
|
<li>Version control tools struggle with >100k files</li>
|
|
<li>Modular units improves intuitive structure and reuse potential</li>
|
|
<li>Versioned linkage of inputs for reproducibility</li>
|
|
</ul>
|
|
</section>
|
|
|
|
|
|
<section data-transition="None" style="text-align: left;">
|
|
<h3>...Dataset nesting</h3>
|
|
|
|
Let's make a nest!
|
|
<div class="fragment">
|
|
Clone a dataset with analysis data into a specific
|
|
location ("input/") in the existing dataset,
|
|
making it a <em>sub</em>dataset:
|
|
<pre style="margin-left: 0;">
|
|
<code class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">datalad clone --dataset . \
|
|
https://hub.datalad.org/edu/iris_data.git \
|
|
input/</code>
|
|
</pre>
|
|
</div>
|
|
|
|
<div class="fragment">
|
|
Let's see what changed in the dataset, using the <code>subdatasets</code> command:
|
|
<pre style="margin-left: 0;">
|
|
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
|
datalad subdatasets
|
|
</code>
|
|
</pre>
|
|
</div>
|
|
<div class="fragment">
|
|
... and also <code>git show</code>:
|
|
<pre style="margin-left: 0;">
|
|
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
|
git show
|
|
</code>
|
|
</pre>
|
|
</div>
|
|
</section>
|
|
|
|
<section data-transition="None" style="text-align:left;">
|
|
<div class="fragment">
|
|
We can now view the cloned dataset's file tree:
|
|
<pre style="margin-left: 0;">
|
|
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
|
cd input
|
|
ls
|
|
</code>
|
|
</pre>
|
|
</div>
|
|
|
|
<div class="fragment">
|
|
...and also its history
|
|
<pre style="margin-left: 0;">
|
|
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
|
tig
|
|
</code>
|
|
</pre>
|
|
</div>
|
|
|
|
<div class="fragment">
|
|
Let's check the dataset size (with the <code>du</code> disk-usage command):
|
|
<pre style="margin-left: 0;">
|
|
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
|
du -sh
|
|
</code>
|
|
</pre>
|
|
</div>
|
|
|
|
<div class="fragment">
|
|
Let's check the <em>actual</em> dataset size:
|
|
<pre style="margin-left: 0;">
|
|
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
|
datalad status --annex
|
|
</code>
|
|
</pre>
|
|
</div>
|
|
|
|
<div class="fragment">
|
|
Let's check try to print the file contents into the terminal (<code>cat</code>):
|
|
<pre style="margin-left: 0;">
|
|
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
|
cat iris.csv
|
|
</code>
|
|
</pre>
|
|
</div>
|
|
|
|
|
|
</section>
|
|
|
|
|
|
|
|
<section data-transition="None" style="text-align: left;">
|
|
<h3>...Data consumption & transport</h3>
|
|
|
|
We can retrieve actual file content with <code>get</code>:
|
|
<pre style="margin-left: 0;">
|
|
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
|
datalad get iris.csv
|
|
</code>
|
|
</pre>
|
|
|
|
<div class="fragment">
|
|
If we don't need a file locally anymore, we can <code>drop</code> its content:
|
|
<pre style="margin-left: 0;">
|
|
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
|
datalad drop iris.csv</code>
|
|
</pre>
|
|
</div>
|
|
<div class="fragment">
|
|
No need to store all files locally, or archive results with
|
|
Giga/Terra-Bytes of source data:
|
|
<pre><code class="python">dl.get('input/sub-01')
|
|
[really complex analysis]
|
|
dl.drop('input/sub-01')</code></pre>
|
|
If data is published anywhere, your data analysis can carry an actionable link to it,
|
|
with barely any space requirements.
|
|
</div>
|
|
</section>
|
|
|
|
|
|
<section>
|
|
<h2>Git versus Git-annex</h2>
|
|
<dl>
|
|
<dt>Data in datasets is either stored in Git or git-annex</dt>
|
|
<dd>By default, everything is <i>annexed</i>, i.e., stored in a dataset annex by git-annex</dd><br>
|
|
<img height="400" src="../pics/artwork/src/publishing/publishing_gitvsannex.svg">
|
|
<br><br>
|
|
<li class="fragment fade-in-then-semi-out">With annexed data, only content identity (hash)
|
|
and location information is put into Git, rather than file content.
|
|
The annex, and transport to and from it is managed with <b>git-annex</b>
|
|
</dl>
|
|
</section>
|
|
|
|
<section data-transition="None">
|
|
<h2>Git versus Git-annex</h2>
|
|
<dl>
|
|
<dt>Configurations (e.g., YODA), custom <a href="http://handbook.datalad.org/en/latest/basics/101-123-config2.html" target="_blank">
|
|
rules</a>, or command parametrization determines if a file is annexed</dt>
|
|
<dd>Storing files in Git or git-annex has distinct advantages:</dd><br>
|
|
|
|
<br>
|
|
|
|
<table >
|
|
<tr style="font-size:35px">
|
|
<td><b>Git</b></td>
|
|
<td><b>git-annex</b></td>
|
|
</tr>
|
|
<tr style="font-size:30px">
|
|
<td>handles <b>small</b> files well (text, code)</td>
|
|
<td>handles <b>all</b> types and sizes of files well</td>
|
|
</tr>
|
|
<tr style="font-size:30px">
|
|
<td>file contents are in the Git history
|
|
and will be <b>shared</b> upon git/datalad push</td>
|
|
<td>file contents are in the annex. Not necessarily shared</td>
|
|
</tr>
|
|
<tr style="font-size:30px">
|
|
<td>Shared with every dataset clone</td>
|
|
<td><b>Can be kept private</b> on a per-file level when sharing the dataset</td>
|
|
</tr>
|
|
<tr style="font-size:30px">
|
|
<td>Useful: Small, non-binary, frequently modified, need-to-be-accessible (DUA, README) files </td>
|
|
<td>Useful: Large files, private files</td>
|
|
</tr>
|
|
</table>
|
|
<br><br>
|
|
<div style="text-align:center" class="fragment">YODA configures the contents of the <code>code/</code>
|
|
directory and the dataset descriptions (e.g., README files) to be in Git.
|
|
There are many other configurations, and you can also
|
|
<a href="http://handbook.datalad.org/en/latest/basics/101-124-procedures.html" target="_blank">
|
|
write your own</a>.<br>
|
|
<img height="100px" src="../pics/yoda.png">
|
|
</div>
|
|
</dl>
|
|
</section>
|
|
|
|
<section data-transition="None" style="text-align: left;">
|
|
<h3>...Computationally reproducible execution...</h3>
|
|
|
|
Try to execute the downloaded analysis script. Does it work?
|
|
<div><pre style="margin-left: 0;"><code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
|
cd ..
|
|
python code/classification_analysis.py</code></pre></div>
|
|
|
|
<ul class="fragment">
|
|
<li>
|
|
Software can be difficult or impossible to install (e.g. conflicts with existing software,
|
|
or on HPC) for you or your collaborators
|
|
</li>
|
|
<li>
|
|
Different software versions/operating systems can produce different results:
|
|
<a href="https://doi.org/10.3389/fninf.2015.00012" target="_blank">Glatard et al., doi.org/10.3389/fninf.2015.00012</a>
|
|
</li>
|
|
<li class="fragment fade-in">
|
|
<strong>Software containers</strong> encapsulate a software environment and isolate it from
|
|
a surrounding operating system. Two common solutions: Docker, Singularity
|
|
</li>
|
|
</ul>
|
|
</section>
|
|
|
|
<section data-transition="None" style="text-align: left;">
|
|
<h3>...Computationally reproducible execution</h3>
|
|
|
|
<div class="fragment">
|
|
With the <code>datalad-container</code> extension, we can add software containers
|
|
to datasets and work with them.
|
|
Let's add a software container with Python software to run the script
|
|
<pre style="margin-left: 0;">
|
|
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
|
datalad containers-add python-env --url shub://adswa/resources:2
|
|
</code>
|
|
</pre>
|
|
</div>
|
|
|
|
|
|
<div class="fragment">
|
|
inspect the list of registered containers:
|
|
<pre style="margin-left: 0;">
|
|
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
|
datalad containers-list
|
|
</code>
|
|
</pre>
|
|
</div>
|
|
|
|
<div class="fragment">
|
|
Now, let's try out the <code>containers-run</code> command:
|
|
<pre style="margin-left: 0;">
|
|
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
|
datalad containers-run -m "run classification analysis in python environment" \
|
|
--container-name python-env \
|
|
--input "input/iris.csv" \
|
|
--output "pairwise_relationships.png" \
|
|
--output "prediction_report.csv" \
|
|
"python3 code/classification_analysis.py {inputs} {outputs}"
|
|
</code>
|
|
</pre>
|
|
</div>
|
|
<div class="fragment">
|
|
What changed after the <code>containers-run</code> command has completed?
|
|
<br>
|
|
We can use <code>datalad diff</code> (based on <code>git diff</code>):
|
|
<pre style="margin-left: 0;">
|
|
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
|
datalad diff -f HEAD~1
|
|
</code>
|
|
</pre>
|
|
</div>
|
|
|
|
<div class="fragment">
|
|
We see that some files were added to the dataset!
|
|
<br>
|
|
And we have a complete provenance record as part of the git history:
|
|
<pre style="margin-left: 0;">
|
|
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
|
git log -n 1
|
|
</code>
|
|
</pre>
|
|
</div>
|
|
</section>
|
|
|
|
|
|
<section data-transition="None" style="text-align: left;">
|
|
<h3>...Computationally reproducible execution...</h3>
|
|
<ul>
|
|
<li class="fragment fade-in-then-semi-out">The <code>datalad run</code>
|
|
can run any command in a way that links the command or script to the
|
|
results it produces and the data it was computed from</li>
|
|
<li class="fragment fade-in-then-semi-out">The <code>datalad rerun</code>
|
|
can take this recorded provenance and recompute the command</li>
|
|
<li class="fragment fade-in-then-semi-out">The <code>datalad containers-run</code>
|
|
(from the extension "datalad-container") can capture software provenance in the form of software containers in addition to the provenance that datalad run captures</li>
|
|
</ul>
|
|
<br><br>
|
|
|
|
</section>
|
|
</section>
|
|
|
|
|
|
<section>
|
|
<section data-markdown data-transition="none"><script type="text/template">
|
|
## "Share data like sourcecode"
|
|
Datasets can be cloned, pushed, and updated from and to **local** and **remote** paths, **remote hosting services**, external **special remotes**
|
|

|
|
<div class="fragment">We will use Forgejo-aneksajo: <a href="https://hub.edu.datalad.org/" target="_blank">hub.edu.datalad.org</a>:</div>
|
|
</script></section>
|
|
|
|
<section data-markdown data-transition="none"><script type="text/template">
|
|
## Objective: Publish the dataset to Forgejo
|
|
|
|
**Preparation: Obtain a token**
|
|
Go to <a href="https://hub.edu.datalad.org/user/settings" target="_blank">hub.edu.datalad.org/user/settings</a>
|
|
<div class="r-stack">
|
|
<img src="../pics/forgejo-token2.png">
|
|
<img class="fragment" src="../pics/forgejo-token3.png">
|
|
</div>
|
|
</script></section>
|
|
|
|
<section data-transition="none">
|
|
<h2>Objective: Publish the dataset to Forgejo</h2>
|
|
|
|
<div>
|
|
<ul>
|
|
<li>Credential prep:</li>
|
|
</ul>
|
|
<pre style="margin-left: 0;">
|
|
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
|
git config --global credential.helper 'store --file ~/.git-credentials'
|
|
</code>
|
|
</pre>
|
|
</div>
|
|
<div>
|
|
<ul>
|
|
<li>Create a new repository <code>my-analysis</code> in the webinterface: <a href="https://hub.edu.datalad.org/repo/create" target="_blank">https://hub.edu.datalad.org/repo/create</a></li>
|
|
<li>Register a sibling / remote URL in the <code>my-analysis</code> dataset, using the URL
|
|
<a href="https://hub.edu.datalad.org/USER-NAME/my-analysis.git" target="_blank">https://hub.edu.datalad.org/USER-NAME/my-analysis.git</a>
|
|
(replace USER-NAME with your forgejo account name):</li>
|
|
</ul>
|
|
<pre style="margin-left: 0;">
|
|
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
|
git remote add origin https://hub.edu.datalad.org/USER-NAME/my-analysis.git
|
|
</code>
|
|
</pre>
|
|
</div>
|
|
|
|
<div>
|
|
<ul>
|
|
<li>Push the dataset and its file contents. What gets reported in your terminal?</li>
|
|
</ul>
|
|
<pre style="margin-left: 0;">
|
|
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
|
datalad push --to origin</code>
|
|
</pre>
|
|
<small>(Supply your account name and the token as password when prompted in the terminal!)</small>
|
|
|
|
</div>
|
|
<br><br>
|
|
<h3>In the forgejo webinterface, explore your newly created repository.</h3>
|
|
</script></section>
|
|
|
|
<section data-transition="none">
|
|
|
|
<h2>Objective: Clone your neighbours dataset</h2>
|
|
<div>
|
|
<ul>
|
|
<li>Clone your right neighbours dataset (replace USER-NAME with <em>their</em> forgejo account name).
|
|
Make sure you're not inside your own dataset.</li>
|
|
</ul>
|
|
<pre style="margin-left: 0;">
|
|
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
|
datalad clone https://hub.edu.datalad.org/USER-NAME/my-analysis.git other-analysis</code>
|
|
</pre>
|
|
</div>
|
|
|
|
<div>
|
|
<ul>
|
|
<li>Find the commit hash of their run commit. Rerun their analyses</li>
|
|
</ul>
|
|
<pre style="margin-left: 0;">
|
|
<code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
|
|
datalad rerun HASH</code>
|
|
</pre>
|
|
</div>
|
|
</section>
|
|
|
|
<section data-markdown data-transition="none"><script type="text/template">
|
|
|
|
**Objective: Stay up to date**
|
|
|
|
- While "push" publishes new developments, "datalad update" fetches or pulls them.
|
|
- "datalad update" <em>fetches</em>, "datalad update --how merge" <em>pulls</em> updates.
|
|
- "-s" declares the sibling to update from.
|
|
- "-r" performs a recursive update.
|
|
- Try pushing and pulling an update yourself.
|
|
```
|
|
datalad update --how merge -s origin
|
|
```
|
|
<!-- .element: style="font-size:75%" -->
|
|
</script></section>
|
|
|
|
<section data-transition="None">
|
|
<h2>Share the ingredients, but also the recipe!</h2>
|
|
<img src="../pics/agoodstart3.png">
|
|
|
|
<imgcredit>CC-BY Scriberia and <a href="https://the-turing-way.netlify.app/reproducible-research/rdm.html" target="_blank">
|
|
The Turing Way</a>
|
|
</imgcredit>
|
|
</section>
|
|
</section>
|
|
|
|
<!-------Examples-------->
|
|
|
|
|
|
<section>
|
|
<section data-transition="None">
|
|
<h3>But what's in it for me? "Selfish" reasons for reproducibility</h3>
|
|
<small>"[...] science is all about more publications, more impact factor, more money and more career. More, more, more ...<br>So how does working reproducibly help me achieve more as a scientist?" - Markowetz, 2015</small><br><br>
|
|
<div>
|
|
<ul>
|
|
<li>You want to avoid the disaster of publishing "a miracle"</li>
|
|
<li>You will be faster (in the long run)
|
|
<ul>
|
|
<li>Finding and fixing errors will be faster</li>
|
|
<li>Progress on new projects will happen faster</li>
|
|
</ul>
|
|
</li>
|
|
<li>Researchers (reviewers!) will have more trust in your findings</li>
|
|
<li>Data sharing can foster collaboration (with your past self, inside and outside your institution) and lead to new projects and publications</li>
|
|
<li>You acquire (technical) skills that will likely become increasingly important for your career, either in academia or industry</li><br>
|
|
</ul></div>
|
|
<div>
|
|
<i><b>It's just useful for your everyday work and makes your life easier!</b></i><br></div>
|
|
<br><br><small>see e.g., Markowetz, 2015, <i>Genome Biology</i>; Poldrack, 2019, <i>Neuron</i></small>
|
|
</section>
|
|
|
|
<section data-transition="None">
|
|
<h2>DataLad</h2>
|
|
<img style="height:300px; margin-top: 0; margin-right:1px;vertical-align:middle;" src="../pics/comic_box3.svg" alt="">
|
|
<br>
|
|
<ul style="font-size:37px">
|
|
<li>Domain-agnostic <strong>command-line tool</strong>
|
|
(+ <strong>graphical user interface</strong>),
|
|
built on top of <a href="https://git-scm.com/" target="_blank">Git</a>
|
|
& <a href="https://git-annex.branchable.com/" target="_blank">Git-annex</a></li>
|
|
<li>Major features:</li>
|
|
<dt>Version-controlling arbitrarily large content </dt>
|
|
<dd>Version control data & software alongside to code!</dd>
|
|
<dt>Transport mechanisms for sharing & obtaining data </dt>
|
|
<dd>Consume & collaborate on data (analyses) like software</dd>
|
|
<dt>(Computationally) reproducible data analysis</dt>
|
|
<dd>Track and share provenance of all digital objects</dd>
|
|
<dt>(... and <i>much</i> more) </dt>
|
|
<br>
|
|
</ul>
|
|
</section>
|
|
|
|
|
|
<section data-transition="None">
|
|
<h2>Further resources and stay in touch</h2>
|
|
<ul>
|
|
If you have questions after the workshop...
|
|
<br><br>
|
|
<ul style="font-size:35px">
|
|
<dt>Reach out to the <b>DataLad</b> team via</dt>
|
|
<li>
|
|
<a href="https://matrix.to/#/!NaMjKIhMXhSicFdxAj:matrix.org?via=matrix.waite.eu&via=matrix.org&via=inm7.de" target="_blank">
|
|
Matrix</a> (free, decentralized communication app, no app needed).
|
|
We run a weekly Zoom office hour (Monday, 2pm Berlin time) from this room as well.
|
|
</li>
|
|
<li>
|
|
<a href="https://github.com/datalad/datalad" target="_blank">
|
|
The development repository on GitHub</a>
|
|
</li>
|
|
<br>
|
|
<dt>Reach out to the (Neuro-) user community with</dt>
|
|
<li>A question on <a href="https://neurostars.org/" target="_blank">neurostars.org</a>
|
|
with a <code>datalad</code> tag</li>
|
|
<br>
|
|
<dt>Find more user tutorials or workshop recordings</dt>
|
|
<li>On <a href="https://www.youtube.com/datalad" target="_blank">
|
|
DataLad's YouTube channel</a>
|
|
</li>
|
|
<li>
|
|
In the <a href="http://handbook.datalad.org/en/latest/" target="_blank">
|
|
DataLad Handbook </a>
|
|
</li>
|
|
<li>In the <a href="https://psychoinformatics-de.github.io/rdm-course/" target="_blank">DataLad RDM course</a> </li>
|
|
<li>In the <a href="http://docs.datalad.org" target="_blank">Official API documentation</a> </li>
|
|
<li> In an overview of most tutorials, talks, videos at
|
|
<a href="https://github.com/datalad/tutorials" target="_blank">github.com/datalad/tutorials</a> </li>
|
|
</ul>
|
|
</ul>
|
|
</section>
|
|
|
|
<section data-transition="None">
|
|
<h2>Acknowledgements</h2>
|
|
<table>
|
|
<tr style="vertical-align:middle">
|
|
<td style="vertical-align:middle">
|
|
<dl>
|
|
<dt style="margin-top:20px">DataLad software <br>
|
|
& ecosystem</dt>
|
|
<dd style="margin-left:5px!important">
|
|
<ul style="margin-left:5px!important">
|
|
<li>Psychoinformatics Lab, <br>
|
|
Research center Jülich</li>
|
|
<li>Center for Open <br>
|
|
Neuroscience, <br>
|
|
Dartmouth College</li>
|
|
<li>Joey Hess (git-annex)</li>
|
|
<li><em>>100 additional contributors</em></li>
|
|
</ul>
|
|
</dd>
|
|
</td>
|
|
<td style="vertical-align:middle">
|
|
<div style="margin-bottom:-20px;text-align:center"><strong>Funders</strong></div>
|
|
<img style="height:150px;margin-right:50px" data-src="../pics/nsf.png" />
|
|
<img style="height:150px;margin-right:50pxi;margin-left:50px" data-src="../pics/binc.png" />
|
|
<img style="height:150px;margin-left:50px" data-src="../pics/bmbf.png" />
|
|
<div style="margin-top:-20px">
|
|
<img style="height:80px;margin-top:-40px;margin-left:40px" data-src="../pics/fzj_logo.svg" />
|
|
<img style="height:60px;margin-left:50px;margin-bottom:25px" data-src="../pics/dfg_logo.png" />
|
|
</div>
|
|
<div style="margin-top:-20px">
|
|
<img style="height:60px;margin-right:20px" data-src="../pics/erdf.png" />
|
|
<img style="height:60px;margin-right:20px" data-src="../pics/cbbs_logo.png" />
|
|
<img style="height:60px" data-src="../pics/LSA-Logo.png" />
|
|
</div>
|
|
<div style="margin-top:40px;margin-bottom:20px;text-align:center"><strong>Collaborators</strong></div>
|
|
<div style="margin-top:-20px">
|
|
<img style="height:100px;margin:20px" data-src="../pics/hbp_logo.png" />
|
|
<img style="height:100px;margin:20px" data-src="../pics/conp_logo.png" />
|
|
<img style="height:120px;margin:10px" data-src="../pics/openneuro_logo.png" />
|
|
</div>
|
|
<div style="margin-top:-40px">
|
|
<img style="height:100px;margin:20px" data-src="../pics/ebrains-logo.png"/>
|
|
<img style="height:100px;margin:0px" data-src="../pics/gin-logo.png" />
|
|
<img style="height:120px;margin:10px" data-src="../pics/sfb1451_logo.png" />
|
|
</div>
|
|
<div style="margin-top:-40px;align:middle">
|
|
<img style="height:140px;margin:10px" data-src="../pics/brainlife_logo.png" />
|
|
<img style="height:100px;margin:0px" data-src="../pics/cbrain_logo.png" />
|
|
<img style="height:100px;margin:20px" data-src="../pics/vbc_logo.png" />
|
|
</div>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
</section>
|
|
|
|
<section data-transition="None">
|
|
<h2>Thank you for your attention!</h2>
|
|
|
|
<img src="../pics/qr_hidarepro26.png" height="400">
|
|
<br><br><small>
|
|
|
|
Slides: <a href="https://doi.org/10.5281/zenodo.19692938" target="_blank">
|
|
DOI 10.5281/zenodo.19692938</a> (Scan the QR code)
|
|
<br><br>
|
|
</small>
|
|
<table>
|
|
<tr>
|
|
</tr>
|
|
<tr style="vertical-align:middle">
|
|
<td style="vertical-align:middle">
|
|
<img src="../pics/winrepo.png">
|
|
</td>
|
|
<td style="font-size: 18px">
|
|
<br><br>
|
|
Women neuroscientists are <a href="https://onlinelibrary.wiley.com/doi/full/10.1111/ejn.14397" target="_blank">
|
|
underrepresented in neuroscience</a>. You can use the <br>
|
|
<a href="https://www.winrepo.org/" target="_blank"> Repository for Women in Neuroscience</a> to find
|
|
and recommend neuroscientists for <br>
|
|
conferences, symposia or collaborations, and help making neuroscience more open & divers.
|
|
</td>
|
|
</tr>
|
|
|
|
</table>
|
|
</section>
|
|
</section>
|
|
|
|
|
|
|
|
<section>
|
|
<section>
|
|
<h2>How does this relate to reproducibility?</h2>
|
|
</section>
|
|
|
|
<section data-transition="None">
|
|
<h2>Exhaustive tracking</h2>
|
|
<dl style="font-size:35px">
|
|
<dt>The building blocks of a scientific result are rarely static</dt>
|
|
<table>
|
|
<tr>
|
|
<td style="vertical-align:middle">Data changes <br>
|
|
<small>(errors are fixed, data is extended,<br>
|
|
naming standards change, an analysis <br>
|
|
requires only a subset of your data...)</small></td>
|
|
<td><img src="../pics/phd052810s.png" height="500">
|
|
<imgcredit>Piled Higher and Deeper
|
|
<a href="https://phdcomics.com/comics/archive_print.php?comicid=1323" target="_blank">
|
|
1323
|
|
</a> </imgcredit></td>
|
|
</tr>
|
|
</table>
|
|
</dl>
|
|
</section>
|
|
|
|
|
|
<section data-transition="None">
|
|
<h3>Exhaustive tracking</h3>
|
|
Once you track changes to data with version control tools,
|
|
you can find out <em>why</em> it changed, <em>what</em> has changed, <em>when</em> it changed,
|
|
and <em>which version</em> of your data was used at which point in time.
|
|
<div class="r-stack">
|
|
<img height="450px" class="fragment fade-out" data-fragment-index="1" src="../pics/tigdata.png">
|
|
<img height="450px" class="fragment" data-fragment-index="1" src="../pics/tigdata3.png">
|
|
<img height="450px" class="fragment" src="../pics/tigdata2.png">
|
|
</div>
|
|
</section>
|
|
|
|
<section>
|
|
<h2>Digital provenance</h2>
|
|
<ul>
|
|
<p >
|
|
= <i>"The tools and processes used to create a
|
|
digital file, the responsible entity, and when and where the process
|
|
events occurred"</i>
|
|
</p>
|
|
<li class="fragment fade-in">
|
|
Have you ever saved a PDF to read later onto your computer, but forgot
|
|
where you got it from? Or did you ever find a figure in your project,
|
|
but forgot which analysis step produced it?
|
|
</li>
|
|
<img src="../pics/Provenance_alpha.png">
|
|
<imgcredit data-fragment-index="1" >Scriberia and <a href="https://the-turing-way.netlify.app">The Turing Way </a> (CC-BY)</imgcredit>
|
|
</ul>
|
|
</section>
|
|
|
|
<section data-transition="None">
|
|
<h3>Data transport: Security and reliability - for data</h3>
|
|
Decentral version control for data integrates with a variety of services
|
|
to let you store data in different places - creating a resilient network for data
|
|
<img src="../pics/decentral_RDM_overview_left.png">
|
|
<small> <a href="https://doi.org/10.1515/nf-2020-0037" target="_blank">"In defense of decentralized Research Data Management", doi.org/10.1515/nf-2020-0037</a> </small>
|
|
</section>
|
|
|
|
<section data-transition="None">
|
|
<h3>Ultimate goal: Reusability</h3>
|
|
Teamscience on more than code:
|
|
<img src="../pics/teamscience.png">
|
|
<img class="fragment" src="../pics/datahistory.png">
|
|
</section>
|
|
|
|
<section>
|
|
<h3>DataLad usecases</h3>
|
|
<div class="r-stack">
|
|
<li data-fragment-index="1" class="fragment fade-in-then-out"> <b>Publish or consume datasets</b>
|
|
via GitHub, GitLab, OSF, the European Open Science Cloud, or similar services</li>
|
|
<li data-fragment-index="2" class="fragment fade-in-then-out">
|
|
Behind-the-scenes <b>infrastructure component for data transport and versioning</b>
|
|
(e.g., used by <a href="https://openneuro.org/" target="_blank"> OpenNeuro</a>,
|
|
<a href="https://brainlife.io/" target="_blank"> brainlife.io </a>,
|
|
the <a href="https://conp.ca/" target="_blank">Canadian Open Neuroscience Platform (CONP)</a>,
|
|
<a href="https://mcin.ca/technology/cbrain/" target="_blank"> CBRAIN</a>)</li>
|
|
<li data-fragment-index="3" class="fragment fade-in-then-out"><b>Central data management</b> and archival system</li>
|
|
<li data-fragment-index="4" class="fragment fade-in-then-out"><b>Decentral data and metadata catalog</b></li>
|
|
<li data-fragment-index="5" class="fragment fade-in-then-out"> <b>Creating and sharing reproducible, open science</b>: Sharing data, software, code, and provenance </li>
|
|
</div>
|
|
<div class="r-stack">
|
|
<img data-fragment-index="1" height="700" class="fragment fade-in-then-out" src="../pics/getdata_studyforrest.gif" alt="a screenrecording of cloning studyforrest data from github">
|
|
<img height="700" class="fragment fade-in-then-out" data-fragment-index="2" src="../pics/openneuro_new_2.gif" alt="a screenrecording of browsing open neuro">
|
|
<img height="700" data-fragment-index="3" class="fragment fade-in-then-out" src="../pics/centralmanagement2.gif">
|
|
<img height="1000" data-fragment-index="4" class="fragment fade-in-then-out" src="../pics/sfb-catalog.gif">
|
|
<img height="700" class="fragment fade-in" data-fragment-index="5" src="../pics/remodnavpaper_2.gif" alt="a screenrecording of cloning REMODNAV paper dataset from github">
|
|
</div>
|
|
</section>
|
|
|
|
<section data-transition="None">
|
|
<h2>A common usecase</h2>
|
|
<div style="margin-top:0.5em;">
|
|
<table style="border: none;table-layout: fixed;">
|
|
<tr>
|
|
<td width="60%"><img style="height:500px; margin-top: 0; margin-right:1px;vertical-align:middle;" data-src="../pics/comic_box1.svg" /></td>
|
|
<td>
|
|
<ul style="vertical-align:middle;">
|
|
<li class="fragment fade-in">
|
|
Alice is a PhD student in a research team.</li>
|
|
<li class="fragment fade-in">
|
|
She works on a fairly typical research project:
|
|
Data collection & processing.</li>
|
|
<li class="fragment fade-in">
|
|
First sample → final result = complex process</li>
|
|
</ul>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
</div><br>
|
|
<h3 class="fragment fade-in">How does Alice go about her daily job?</h3>
|
|
</section>
|
|
|
|
|
|
<section data-transition="None">
|
|
<h2>A common usecase</h2>
|
|
<ul>
|
|
<li class="fragment fade-in">
|
|
In her project, Alice likes to have an automated record of:
|
|
<ul>
|
|
<li>when a given file was last changed</li>
|
|
<li>where it came from</li>
|
|
<li>what input files were used to generate a given output</li>
|
|
<li>why some things were done.</li>
|
|
</ul>
|
|
</li>
|
|
<br>
|
|
<li class="fragment fade-in">
|
|
Even if she doesn't share her work, this is essential for her future self</li>
|
|
<li class="fragment fade-in">
|
|
Her project is exploratory: Frequent changes to her analysis scripts</li>
|
|
<li class="fragment fade-in">
|
|
She enjoys the comfort of being able to return to a previously recorded state</li>
|
|
</ul>
|
|
<br><br>
|
|
<h3 class="fragment fade-in">This is: *local version control*</h3>
|
|
</section>
|
|
|
|
|
|
<section data-transition="None">
|
|
<h2>A common usecase</h2>
|
|
<ul>
|
|
<li class="fragment fade-in" data-fragment-index="1">
|
|
Alice's work is not confined to a single computer:
|
|
<ul>
|
|
<li>Laptop / desktop / remote server / dedicated back-up</li>
|
|
<li>Alice wants to automatically & efficiently synchronize</li>
|
|
</ul>
|
|
</li>
|
|
<br>
|
|
<li class="fragment fade-in" data-fragment-index="2">
|
|
Parts of the data are collected or analyzed by colleagues.
|
|
This requires:
|
|
<ul>
|
|
<li>distributed synchronization with centralized storage</li>
|
|
<li>preservation of origin & authorship of changes</li>
|
|
<li>effective combination of simultaneous contributions</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<br><br>
|
|
<h3 class="fragment fade-in" data-fragment-index="3">This is: *distributed version control*</h3>
|
|
</section>
|
|
|
|
|
|
<section data-transition="None">
|
|
<h2>A common usecase</h2>
|
|
<ul>
|
|
<li class="fragment fade-in">
|
|
Alice applies local version control for her own work, and reproducibly records it
|
|
</li>
|
|
<li class="fragment fade-in">
|
|
She also applies distributed version control when working with colleagues
|
|
and collaborators
|
|
</li>
|
|
<li class="fragment fade-in">
|
|
She often needs to work on a subset of data at any given time:
|
|
<ul>
|
|
<li>all files are kept on a server</li>
|
|
<li>a few files are rotated into and out of her laptop</li>
|
|
</ul>
|
|
</li>
|
|
<li class="fragment fade-in">
|
|
Alice wants to publish the data at project's end:
|
|
<ul>
|
|
<li>raw data / outputs / both</li>
|
|
<li>completely or selectively</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<br><br>
|
|
<h3 class="fragment fade-in">This is: *data management (with DataLad 😀)*</h3>
|
|
</section>
|
|
</section>
|
|
|
|
|
|
|
|
|
|
|
|
</div>
|
|
</div>
|
|
|
|
<script src="../reveal.js/dist/reveal.js"></script>
|
|
<script src="../reveal.js/plugin/notes/notes.js"></script>
|
|
<script src="../reveal.js/plugin/markdown/markdown.js"></script>
|
|
<script src="../reveal.js/plugin/highlight/highlight.js"></script>
|
|
<script src="../custom_functions.js"></script>
|
|
<script>
|
|
// More info about initialization & config:
|
|
// - https://revealjs.com/initialization/
|
|
// - https://revealjs.com/config/
|
|
Reveal.initialize({
|
|
hash: true,
|
|
// The "normal" size of the presentation, aspect ratio will be preserved
|
|
// when the presentation is scaled to fit different resolutions. Can be
|
|
// specified using percentage units.
|
|
width: 1280,
|
|
height: 960,
|
|
// Factor of the display size that should remain empty around the content
|
|
margin: 0.1,
|
|
// Bounds for smallest/largest possible scale to apply to content
|
|
minScale: 0.2,
|
|
maxScale: 1.5,
|
|
|
|
controls: true,
|
|
progress: true,
|
|
history: true,
|
|
center: true,
|
|
slideNumber: 'c',
|
|
pdfSeparateFragments: false,
|
|
pdfMaxPagesPerSlide: 1,
|
|
pdfPageHeightOffset: -1,
|
|
transition: 'slide', // none/fade/slide/convex/concave/zoom
|
|
// Learn about plugins: https://revealjs.com/plugins/
|
|
plugins: [ RevealMarkdown, RevealHighlight, RevealNotes ]
|
|
});
|
|
</script>
|
|
</body>
|
|
</html>
|