531 lines
20 KiB
HTML
531 lines
20 KiB
HTML
<!doctype html>
|
|
<html>
|
|
<head>
|
|
<meta charset="utf-8">
|
|
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no">
|
|
|
|
<!-- Edit me start! -->
|
|
<title>Computational reproducibility</title>
|
|
<meta name="description" content=" Reproducible processing with DataLad ">
|
|
<meta name="author" content=" Adina Wagner & Michael Hanke ">
|
|
<!-- Edit me end! -->
|
|
|
|
<link rel="stylesheet" href="../reveal.js/dist/reset.css">
|
|
<link rel="stylesheet" href="../reveal.js/dist/reveal.css">
|
|
<link rel="stylesheet" href="../reveal.js/dist/theme/beige.css">
|
|
<link rel="stylesheet" href="../css/main.css">
|
|
<!-- Theme used for syntax highlighted code -->
|
|
<link rel="stylesheet" href="../reveal.js/plugin/highlight/monokai.css">
|
|
</head>
|
|
<body>
|
|
<div class="reveal">
|
|
<div class="slides">
|
|
|
|
<!--...Datalad Basics...-->
|
|
|
|
<section>
|
|
<section>
|
|
<script src="https://cdn.logwork.com/widget/countdown.js"></script>
|
|
<a href="https://logwork.com/countdown-2zu8" class="countdown-timer"
|
|
data-style="columns" data-timezone="Europe/Berlin" data-date="2022-07-21 13:30">
|
|
Up next: Computational reproducibility </a>
|
|
</section>
|
|
</section>
|
|
|
|
<section>
|
|
<section>
|
|
[Overflow area to finish everything unfinished]
|
|
</section>
|
|
|
|
<section data-markdown style="font-size:30px"><script type="text/template">
|
|
## Dataset management for Reproducibility & Reusability
|
|
<small>Read more at <a href="https://psychoinformatics-de.github.io/rdm-course/04-dataset-management/index.htmlhttps://psychoinformatics-de.github.io/rdm-course/04-dataset-management/index.html" target="_blank">
|
|
psychoinformatics-de.github.io/rdm-course/04-dataset-management
|
|
</a> </small>
|
|

|
|
|
|
When setting up data analyses...
|
|
- Create MODULAR datasets: Whenever a particular collection of files could anyhow be useful in more than one context (e.g. data), put them in their own dataset, and install it as a subdataset. <!-- .element: class="fragment fade-in-then-semi-out" -->
|
|
- Keep everything STRUCTURED: Bundle all components of one analysis into one superdataset, and within this dataset, separate code, data, output, execution environments.<!-- .element: class="fragment fade-in-then-semi-out" -->
|
|
- Keep a dataset SELF-CONTAINED with relative paths in scripts to subdatasets or files.
|
|
Do not use absolute paths.<!-- .element: class="fragment fade-in-then-semi-out" -->
|
|
|
|
</script>
|
|
</section>
|
|
|
|
<section data-transition="None">
|
|
<h2>Why Modularity?</h2>
|
|
|
|
<ul style="font-size:30px">
|
|
<li>1. Reuse and access management</li>
|
|
<img src="../pics/ukb_datasets.svg" height="500px">
|
|
|
|
</li>
|
|
<li class="fragment fade-in" data-fragment-index="1">2. Scalability</li>
|
|
<pre class="fragment fade-in" data-fragment-index="1"><code class="fragment fade-in" data-fragment-index="1">adina@bulk1 in /ds/hcp/super on git:master❱ datalad status --annex -r
|
|
15530572 annex'd files (77.9 TB recorded total size)
|
|
nothing to save, working tree clean</code></pre>
|
|
<small class="fragment fade-in" data-fragment-index="1"><a href="https://github.com/datalad-datasets/human-connectome-project-openaccess" target="_blank">(github.com/datalad-datasets/human-connectome-project-openaccess)</a></small>
|
|
</ul>
|
|
|
|
</section>
|
|
|
|
<section style="font-size:30px" data-transition="None">
|
|
<h2>Why Modularity?</h2>
|
|
<ul>
|
|
<li>3. Transparency</li><br>
|
|
|
|
Original:
|
|
<pre><code class="sh" style="max-height:none" data-trim>
|
|
/dataset
|
|
├── sample1
|
|
│ └── a001.dat
|
|
├── sample2
|
|
│ └── a001.dat
|
|
...
|
|
</code></pre>
|
|
<div class="fragment">
|
|
Without modularity, after applied transform (preprocessing, analysis, ...):
|
|
<pre><code class="sh" style="max-height:none" data-trim>
|
|
/dataset
|
|
├── sample1
|
|
│ ├── ps34t.dat
|
|
│ └── a001.dat
|
|
├── sample2
|
|
│ ├── ps34t.dat
|
|
│ └── a001.dat
|
|
...
|
|
</code></pre>
|
|
Without expert/domain knowledge, no distinction between original and derived data
|
|
possible.
|
|
</div>
|
|
</ul>
|
|
</section>
|
|
|
|
|
|
<section style="font-size:30px" data-transition="None">
|
|
<h2>Why Modularity?</h2>
|
|
<ul>
|
|
<li>3. Transparency</li><br>
|
|
|
|
Original:
|
|
<pre><code class="sh" style="max-height:none" data-trim>
|
|
/raw_dataset
|
|
├── sample1
|
|
│ └── a001.dat
|
|
├── sample2
|
|
│ └── a001.dat
|
|
...
|
|
</code></pre>
|
|
<strong>With modularity</strong> after applied transform (preprocessing, analysis, ...)
|
|
<pre><code class="sh" style="max-height:none" data-trim>
|
|
/derived_dataset
|
|
├── sample1
|
|
│ └── ps34t.dat
|
|
├── sample2
|
|
│ └── ps34t.dat
|
|
├── ...
|
|
└── inputs
|
|
└── raw
|
|
├── sample1
|
|
│ └── a001.dat
|
|
├── sample2
|
|
│ └── a001.dat
|
|
...
|
|
</code></pre>
|
|
Clearer separation of semantics, through use of pristine version of original dataset within a
|
|
<em>new, additional</em> dataset holding the outputs.</ul>
|
|
|
|
</section>
|
|
|
|
|
|
<section>
|
|
<h2>A machine-learning example</h2>
|
|
|
|
Code along or try it later at <br>
|
|
<a href="http://handbook.datalad.org/en/latest/usecases/ml-analysis.html" target="_blank">
|
|
handbook.datalad.org/usecases/ml-analysis.html</a>
|
|
</a>
|
|
</section>
|
|
|
|
<section>
|
|
<h2>Analysis layout</h2>
|
|
<table>
|
|
<tr>
|
|
<td>
|
|
<ul>
|
|
<li>Prepare an input data set</li>
|
|
<li class="fragment fade-in">Configure and setup an analysis dataset</li>
|
|
<li class="fragment fade-in">Prepare data</li>
|
|
<li class="fragment fade-in">Train models and evaluate them</li>
|
|
<li class="fragment fade-in">Compare different models, repeat with updated data</li>
|
|
</ul>
|
|
</td>
|
|
<td>
|
|
<img src="../pics/imagenette.png" width="800">
|
|
<small>Imagenette dataset</small>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
</section>
|
|
|
|
<section>
|
|
<h2>Prepare an input dataset</h2>
|
|
<ul>
|
|
<li>Create a stand-alone input dataset</li>
|
|
<li>Either add data and <code>datalad save</code> it, or use commands such as <code>datalad download-url</code>
|
|
or <code>datalad add-urls</code> to retrieve it from web-sources</li>
|
|
</ul>
|
|
</section>
|
|
|
|
<section>
|
|
<h2>Configure and setup an analysis dataset</h2>
|
|
<ul>
|
|
<li>Given the purpose of an analysis dataset, configurations can make it easier to use:</li>
|
|
<ul>
|
|
<li><code>-c yoda</code> prepares a useful structure</li>
|
|
<li><code>-c text2git</code> keeps text files such as scripts in Git</li>
|
|
</ul>
|
|
<li>The input dataset is installed as a subdataset</li>
|
|
<li>Required software is containerized and added to the dataset</li>
|
|
</ul>
|
|
</section>
|
|
|
|
|
|
<section data-transition="None">
|
|
<h3>Sharing software environments: Why and how</h3>
|
|
|
|
<p style="font-size:35px"> Science has many different building blocks: Code, software, and data produce research outputs.
|
|
The more you share, the more likely can others reproduce your results <br></p>
|
|
<img height="750px" src="../pics/agoodstart.png">
|
|
</section>
|
|
|
|
|
|
|
|
<section data-transition="None">
|
|
<h3>Sharing software environments: Why and how</h3>
|
|
<ul style="font-size:35px">
|
|
<li>
|
|
Software can be difficult or impossible to install (e.g. conflicts with existing software,
|
|
or on HPC) for you or your collaborators
|
|
</li>
|
|
<li>
|
|
Different software versions/operating systems can produce different results:
|
|
<a href="https://doi.org/10.3389/fninf.2015.00012" target="_blank">Glatard et al., doi.org/10.3389/fninf.2015.00012</a>
|
|
</li>
|
|
<iframe width="1200" height="700" src="https://doi.org/10.3389/fninf.2015.00012"></iframe>
|
|
</ul>
|
|
</section>
|
|
|
|
|
|
<section>
|
|
<h2>Software containers</h2>
|
|
<ul style="font-size:35px">
|
|
<table>
|
|
<tr>
|
|
<td>
|
|
<img src="../pics/dockerexplain.png" height="500">
|
|
</td>
|
|
<td><img height="100" src="../pics/blog_docker.png"><br>
|
|
<img height="100" src="../pics/singularitylogo.jpg"> </td>
|
|
</tr>
|
|
|
|
</table>
|
|
</img>
|
|
<li>
|
|
Put simple, a cut-down virtual machine that is a portable and shareable
|
|
bundle of software libraries and their dependencies
|
|
</li>
|
|
<li><strong>Docker</strong> runs on all operating systems, but requires "sudo" (i.e., admin) privileges</li>
|
|
<li><strong>Singularity</strong> can run on computational clusters (no "sudo") but is not (well) on non-Linux</li>
|
|
<li>Their containers are different, but interoperable - e.g., Singularity can use and build Docker Images</li>
|
|
</ul>
|
|
</section>
|
|
|
|
<section>
|
|
<h2>The datalad-container extension</h2>
|
|
<ul style="font-size:30px">
|
|
<li>
|
|
The <code>datalad-container</code> extension gives DataLad commands to add, track, retrieve, and
|
|
execute Docker or Singularity containers.
|
|
</li>
|
|
<pre><code>pip/conda install datalad-container</code></pre>
|
|
<li>
|
|
If this extension is installed, DataLad can register software containers as "just another file" to your
|
|
dataset, and <strong>datalad containers-run</strong> analysis inside the container, capturing software as additional
|
|
provenance
|
|
</li>
|
|
</ul>
|
|
<img class="fragment fade-in" src="../pics/containers-run.svg" height="600"> <!-- .element: class="fragment" -->
|
|
</section>
|
|
|
|
|
|
<section>
|
|
<h2>Did you know...</h2>
|
|
<ul style="font-size:30px">
|
|
Helpful resources for working with software containers:
|
|
<li>
|
|
<a href="https://github.com/jupyterhub/repo2docker" target="_blank">
|
|
repo2docker</a> can fetch a Git repository/DataLad dataset and builds
|
|
a container image from configuration files
|
|
</li>
|
|
<li>
|
|
<a href="https://github.com/ReproNim/neurodocker" target="_blank">
|
|
neurodocker</a> can generate custom Dockerfiles and Singularity recipes
|
|
for neuroimaging.
|
|
</a>
|
|
</li>
|
|
<li>
|
|
<a href="https://github.com/repronim/containers" target="_blank">
|
|
The ReproNim container collection</a>, a DataLad dataset that
|
|
includes common neuroimaging software as configured singularity containers.
|
|
</li>
|
|
<li>
|
|
<a href="https://github.com/rocker-org/rocker" target="_blank">
|
|
rocker</a> - Docker container for R users
|
|
</li>
|
|
</ul>
|
|
</section>
|
|
|
|
<section>
|
|
<h2>Prepare data</h2>
|
|
<ul>
|
|
<li>Add a script for data preparation (labels train and validation images)</li>
|
|
<li>Execute it using <code>datalad containers-run</code></li>
|
|
</ul>
|
|
</section>
|
|
|
|
<section>
|
|
<h2>Train models and evaluate them</h2>
|
|
<ul>
|
|
<li>Add scripts for training and evaluation.
|
|
This dataset state can be tagged to identify it easily at a later point</li>
|
|
<li>Execute the scripts using <code>datalad containers-run</code></li>
|
|
<li>By dumping a trained model as a joblib object the trained classifier stays reusable</li>
|
|
</ul>
|
|
</section>
|
|
</section>
|
|
|
|
<section>
|
|
<section data-markdown><script type="text/template">
|
|
# And now what?
|
|
</script></section>
|
|
<section data-markdown><script type="text/template">
|
|
## When everything is tracked: A reproducible paper
|
|
|
|
<iframe width="1120" height="630" src="https://www.youtube-nocookie.com/embed/nhLqmF58SLQ" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
|
|
|
|
- Peer-reviewed paper published in Behavior Research Methods [[DOI 10.3758/s13428-020-01428-x](https://doi.org/10.3758/s13428-020-01428-x)]<!-- .element: style="font-size:70%" -->
|
|
|
|
- Free to reproduce at https://github.com/psychoinformatics-de/paper-remodnav more details in the DataLad handbook
|
|
http://handbook.datalad.org/r.html?reproducible-paper.
|
|
|
|
- Full video: https://youtube.com/datalad
|
|
|
|
<!-- .element: style="font-size:70%" -->
|
|
Note:
|
|
- VERY useful prior publication
|
|
</script></section>
|
|
|
|
<section data-markdown><script type="text/template">
|
|
# Anticipate change!
|
|
</script></section>
|
|
|
|
<section data-markdown data-transition="none"><script type="text/template">
|
|
## Exhaustive capture enables portability
|
|
<!-- .element: width="100%" -->
|
|
Precise identification of data and computational environments, combined for provenance records form a comprehensive and portable data structure, capturing all aspects of an investigation.
|
|
|
|
**Easily take your stuff with you, whereever and whenever you move on!**
|
|
</script></section>
|
|
|
|
<section data-markdown><script type="text/template">
|
|
## Services
|
|
<!-- .element: height="500" style="box-shadow: 10px 10px 8px #888888" -->
|
|
|
|
- make *the* difference for advertisment, discovery, convenience
|
|
- but imply gigantic dependencies
|
|
- often impossible to "take over"
|
|
|
|
**Make sure data/metadata are self-contained<br>to facilitate/enable transition to another service**
|
|
<aside class="notes">
|
|
Note to self
|
|
</aside>
|
|
</script>
|
|
</section>
|
|
|
|
<section data-markdown><script type="text/template">
|
|
# Is it really worth the investment?
|
|
</script></section>
|
|
|
|
<section data-markdown><script type="text/template">
|
|
## FAIRly big: Process the UK Biobank (imaging data)
|
|
<!-- .element: height="400" -->
|
|
|
|
- 76 TB in 43 million files in total
|
|
- 42,715 participants contributed personal health data
|
|
- Strict DUA
|
|
- Custom binary-only downloader
|
|
- Most data records offered as (unversioned) ZIP files
|
|
</script></section>
|
|
|
|
<section data-markdown><script type="text/template">
|
|
## Challenges
|
|
|
|
- Process data such that
|
|
- Results are computationally reproducible (without the original compute infrastructure)
|
|
- There is complete linkage from results to an individual data record download
|
|
- It scales with the amount of available compute resources
|
|
|
|
- Data processing pipeline
|
|
- Compiled MATLAB blob
|
|
- 1h processing time per image, with 41k images to process
|
|
- 1.2 M output files (30 output files per input file)
|
|
- 1.2 TB total size of outputs
|
|
</script></section>
|
|
|
|
<section data-markdown><script type="text/template">
|
|
## FAIRly big setup
|
|
<!-- .element: width="1200" style="margin-top:-35px;margin-bottom:-30px" -->
|
|
|
|
- UKB DataLad extension can track the evolution of the complete data release in DataLad datasets
|
|
<!-- .element: style="font-size:80%" -->
|
|
- Full version history
|
|
<!-- .element: style="font-size:80%" -->
|
|
- Native and BIDSified data layout
|
|
<!-- .element: style="font-size:80%" -->
|
|
|
|
<note>Wagner, Waite, Wierzba, Hoffstaedter, Waite, Poldrack, Eickhoff, Hanke (2021). FAIRly big: A framework for computationally reproducible processing of large-scale data.</note>
|
|
</script></section>
|
|
|
|
<section data-markdown><script type="text/template">
|
|
## FAIRly big workflow
|
|
<!-- .element: width="1200" style="margin-top:-35px;margin-bottom:-30px" -->
|
|
- Common data representation in secure environments
|
|
- Content-agnostic persistent (encrypted) storage
|
|
- All computations in freshly bootstrapped emphemeral environments, only using information from a fully self-contained DataLad dataset
|
|
<note>Wagner et al. (2021). FAIRly big: A framework for computationally reproducible processing of large-scale data.</note>
|
|
</script></section>
|
|
|
|
<section data-markdown><script type="text/template">
|
|
## FAIRly big provenance capture
|
|
<!-- .element: width="1200" style="margin-top:-35px;margin-bottom:-30px" -->
|
|
|
|
- Every single pipeline execution is tracked
|
|
- Each execution individually reproducible without HPC access
|
|
<note>Wagner et al. (2021). FAIRly big: A framework for computationally reproducible processing of large-scale data.</note>
|
|
</script></section>
|
|
|
|
<section data-markdown><script type="text/template">
|
|
## FAIRly big movie
|
|
|
|
<iframe width="1120" height="630" src="https://www.youtube-nocookie.com/embed/UsW6xN2f2jc?start=17" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
|
|
|
|
- Rendered exclusively from information captured by DataLad in the output dataset. Full video: https://youtube.com/datalad
|
|
- Two full (re-)computations, programmatically comparable, verifiable, reproducible -- on any system with data access
|
|
</script></section>
|
|
</section>
|
|
|
|
<section data-markdown><script type="text/template">
|
|
## Interoperable digital research ecosystem
|
|
|
|
<!-- .element: width="100%" style="margin-bottom:100px" -->
|
|
|
|
<note>Hanke, Pestilli, Wagner, Markiewicz, Poline & Halchenko (2021). In defense of decentralized research data
|
|
management. Neuroforum, 72, 17-25.</note>
|
|
|
|
Note:
|
|
- Freedom of infrastructure selection
|
|
- Transitions between institutions and stewards
|
|
- Facilitate diverse collaboration
|
|
</script></section>
|
|
|
|
<section data-markdown><script type="text/template">
|
|
## Training and guideline development
|
|
<!-- .element: width="1100" style="margin-top:-20px;margin-bottom:30px" -->
|
|
|
|
<note>Adina S. Wagner, Jean-Baptiste Poline, Michael Hanke. *A pragmatic approach to reusable research outputs*. <a href="https://doi.org/10.7490/f1000research.1118575.1">10.7490/f1000research.1118575.1</a> More hands-on details in the DataLad handbook at http://handbook.datalad.org</note>
|
|
</script></section>
|
|
|
|
</section>
|
|
|
|
<section>
|
|
<section>
|
|
<h2>After the workshop</h2>
|
|
<ul>
|
|
If you have a question after the workshop, you can reach out for help:
|
|
<br>
|
|
<ul style="font-size:30px">
|
|
<dt>Reach out to to the <b>DataLad</b> team via</dt>
|
|
<li>
|
|
<a href="https://matrix.to/#/!NaMjKIhMXhSicFdxAj:matrix.org?via=matrix.waite.eu&via=matrix.org&via=inm7.de" target="_blank">
|
|
Matrix</a> (free, decentralized communication app, no app needed).
|
|
We run a weekly Zoom office hour (Thursday, 4pm Berlin time) from this room as well.
|
|
</li>
|
|
<li>
|
|
<a href="https://github.com/datalad/datalad" target="_blank">
|
|
the development repository on GitHub</a>
|
|
</li>
|
|
<br>
|
|
<dt>Reach out to the user community with</dt>
|
|
<li>A question on <a href="https://neurostars.org/" target="_blank">neurostars.org</a>
|
|
with a <code>datalad</code> tag</li>
|
|
<br>
|
|
<dt>Find more user tutorials or workshop recordings</dt>
|
|
<li>On <a href="https://www.youtube.com/channel/datalad" target="_blank">
|
|
DataLad's YouTube channel</a>
|
|
</li>
|
|
<li>
|
|
In the <a href="http://handbook.datalad.org/en/latest/" target="_blank">
|
|
DataLad Handbook </a>
|
|
</li>
|
|
<li>In the <a href="https://psychoinformatics-de.github.io/rdm-course/" target="_blank">DataLad RDM course</a> </li>
|
|
<li>In the <a href="http://docs.datalad.org" target="_blank">Official API documentation</a> </li>
|
|
</ul>
|
|
</ul>
|
|
</section>
|
|
</section>
|
|
|
|
</section>
|
|
|
|
|
|
|
|
</div>
|
|
</div>
|
|
|
|
<script src="../reveal.js/dist/reveal.js"></script>
|
|
<script src="../reveal.js/plugin/notes/notes.js"></script>
|
|
<script src="../reveal.js/plugin/markdown/markdown.js"></script>
|
|
<script src="../reveal.js/plugin/highlight/highlight.js"></script>
|
|
<script>
|
|
// More info about initialization & config:
|
|
// - https://revealjs.com/initialization/
|
|
// - https://revealjs.com/config/
|
|
Reveal.initialize({
|
|
hash: true,
|
|
// The "normal" size of the presentation, aspect ratio will be preserved
|
|
// when the presentation is scaled to fit different resolutions. Can be
|
|
// specified using percentage units.
|
|
width: 1280,
|
|
height: 960,
|
|
// Factor of the display size that should remain empty around the content
|
|
margin: 0.3,
|
|
// Bounds for smallest/largest possible scale to apply to content
|
|
minScale: 0.2,
|
|
maxScale: 1.0,
|
|
|
|
controls: true,
|
|
progress: true,
|
|
history: true,
|
|
center: true,
|
|
slideNumber: 'c',
|
|
pdfSeparateFragments: false,
|
|
pdfMaxPagesPerSlide: 1,
|
|
pdfPageHeightOffset: -1,
|
|
transition: 'slide', // none/fade/slide/convex/concave/zoom
|
|
// Learn about plugins: https://revealjs.com/plugins/
|
|
plugins: [ RevealMarkdown, RevealHighlight, RevealNotes ]
|
|
});
|
|
</script>
|
|
</body>
|
|
</html>
|