datalad-course/html/mpsc-reproducibility.html

531 lines
20 KiB
HTML

<!doctype html>
<html>
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no">
<!-- Edit me start! -->
<title>Computational reproducibility</title>
<meta name="description" content=" Reproducible processing with DataLad ">
<meta name="author" content=" Adina Wagner & Michael Hanke ">
<!-- Edit me end! -->
<link rel="stylesheet" href="../reveal.js/dist/reset.css">
<link rel="stylesheet" href="../reveal.js/dist/reveal.css">
<link rel="stylesheet" href="../reveal.js/dist/theme/beige.css">
<link rel="stylesheet" href="../css/main.css">
<!-- Theme used for syntax highlighted code -->
<link rel="stylesheet" href="../reveal.js/plugin/highlight/monokai.css">
</head>
<body>
<div class="reveal">
<div class="slides">
<!--...Datalad Basics...-->
<section>
<section>
<script src="https://cdn.logwork.com/widget/countdown.js"></script>
<a href="https://logwork.com/countdown-2zu8" class="countdown-timer"
data-style="columns" data-timezone="Europe/Berlin" data-date="2022-07-21 13:30">
Up next: Computational reproducibility </a>
</section>
</section>
<section>
<section>
[Overflow area to finish everything unfinished]
</section>
<section data-markdown style="font-size:30px"><script type="text/template">
## Dataset management for Reproducibility & Reusability
<small>Read more at <a href="https://psychoinformatics-de.github.io/rdm-course/04-dataset-management/index.htmlhttps://psychoinformatics-de.github.io/rdm-course/04-dataset-management/index.html" target="_blank">
psychoinformatics-de.github.io/rdm-course/04-dataset-management
</a> </small>
![](../pics/dataset_modules.png)
When setting up data analyses...
- Create MODULAR datasets: Whenever a particular collection of files could anyhow be useful in more than one context (e.g. data), put them in their own dataset, and install it as a subdataset. <!-- .element: class="fragment fade-in-then-semi-out" -->
- Keep everything STRUCTURED: Bundle all components of one analysis into one superdataset, and within this dataset, separate code, data, output, execution environments.<!-- .element: class="fragment fade-in-then-semi-out" -->
- Keep a dataset SELF-CONTAINED with relative paths in scripts to subdatasets or files.
Do not use absolute paths.<!-- .element: class="fragment fade-in-then-semi-out" -->
</script>
</section>
<section data-transition="None">
<h2>Why Modularity?</h2>
<ul style="font-size:30px">
<li>1. Reuse and access management</li>
<img src="../pics/ukb_datasets.svg" height="500px">
</li>
<li class="fragment fade-in" data-fragment-index="1">2. Scalability</li>
<pre class="fragment fade-in" data-fragment-index="1"><code class="fragment fade-in" data-fragment-index="1">adina@bulk1 in /ds/hcp/super on git:master❱ datalad status --annex -r
15530572 annex'd files (77.9 TB recorded total size)
nothing to save, working tree clean</code></pre>
<small class="fragment fade-in" data-fragment-index="1"><a href="https://github.com/datalad-datasets/human-connectome-project-openaccess" target="_blank">(github.com/datalad-datasets/human-connectome-project-openaccess)</a></small>
</ul>
</section>
<section style="font-size:30px" data-transition="None">
<h2>Why Modularity?</h2>
<ul>
<li>3. Transparency</li><br>
Original:
<pre><code class="sh" style="max-height:none" data-trim>
/dataset
├── sample1
│ └── a001.dat
├── sample2
│ └── a001.dat
...
</code></pre>
<div class="fragment">
Without modularity, after applied transform (preprocessing, analysis, ...):
<pre><code class="sh" style="max-height:none" data-trim>
/dataset
├── sample1
│ ├── ps34t.dat
│ └── a001.dat
├── sample2
│ ├── ps34t.dat
│ └── a001.dat
...
</code></pre>
Without expert/domain knowledge, no distinction between original and derived data
possible.
</div>
</ul>
</section>
<section style="font-size:30px" data-transition="None">
<h2>Why Modularity?</h2>
<ul>
<li>3. Transparency</li><br>
Original:
<pre><code class="sh" style="max-height:none" data-trim>
/raw_dataset
├── sample1
│ └── a001.dat
├── sample2
│ └── a001.dat
...
</code></pre>
<strong>With modularity</strong> after applied transform (preprocessing, analysis, ...)
<pre><code class="sh" style="max-height:none" data-trim>
/derived_dataset
├── sample1
│ └── ps34t.dat
├── sample2
│ └── ps34t.dat
├── ...
└── inputs
└── raw
├── sample1
│ └── a001.dat
├── sample2
│ └── a001.dat
...
</code></pre>
Clearer separation of semantics, through use of pristine version of original dataset within a
<em>new, additional</em> dataset holding the outputs.</ul>
</section>
<section>
<h2>A machine-learning example</h2>
Code along or try it later at <br>
<a href="http://handbook.datalad.org/en/latest/usecases/ml-analysis.html" target="_blank">
handbook.datalad.org/usecases/ml-analysis.html</a>
</a>
</section>
<section>
<h2>Analysis layout</h2>
<table>
<tr>
<td>
<ul>
<li>Prepare an input data set</li>
<li class="fragment fade-in">Configure and setup an analysis dataset</li>
<li class="fragment fade-in">Prepare data</li>
<li class="fragment fade-in">Train models and evaluate them</li>
<li class="fragment fade-in">Compare different models, repeat with updated data</li>
</ul>
</td>
<td>
<img src="../pics/imagenette.png" width="800">
<small>Imagenette dataset</small>
</td>
</tr>
</table>
</section>
<section>
<h2>Prepare an input dataset</h2>
<ul>
<li>Create a stand-alone input dataset</li>
<li>Either add data and <code>datalad save</code> it, or use commands such as <code>datalad download-url</code>
or <code>datalad add-urls</code> to retrieve it from web-sources</li>
</ul>
</section>
<section>
<h2>Configure and setup an analysis dataset</h2>
<ul>
<li>Given the purpose of an analysis dataset, configurations can make it easier to use:</li>
<ul>
<li><code>-c yoda</code> prepares a useful structure</li>
<li><code>-c text2git</code> keeps text files such as scripts in Git</li>
</ul>
<li>The input dataset is installed as a subdataset</li>
<li>Required software is containerized and added to the dataset</li>
</ul>
</section>
<section data-transition="None">
<h3>Sharing software environments: Why and how</h3>
<p style="font-size:35px"> Science has many different building blocks: Code, software, and data produce research outputs.
The more you share, the more likely can others reproduce your results <br></p>
<img height="750px" src="../pics/agoodstart.png">
</section>
<section data-transition="None">
<h3>Sharing software environments: Why and how</h3>
<ul style="font-size:35px">
<li>
Software can be difficult or impossible to install (e.g. conflicts with existing software,
or on HPC) for you or your collaborators
</li>
<li>
Different software versions/operating systems can produce different results:
<a href="https://doi.org/10.3389/fninf.2015.00012" target="_blank">Glatard et al., doi.org/10.3389/fninf.2015.00012</a>
</li>
<iframe width="1200" height="700" src="https://doi.org/10.3389/fninf.2015.00012"></iframe>
</ul>
</section>
<section>
<h2>Software containers</h2>
<ul style="font-size:35px">
<table>
<tr>
<td>
<img src="../pics/dockerexplain.png" height="500">
</td>
<td><img height="100" src="../pics/blog_docker.png"><br>
<img height="100" src="../pics/singularitylogo.jpg"> </td>
</tr>
</table>
</img>
<li>
Put simple, a cut-down virtual machine that is a portable and shareable
bundle of software libraries and their dependencies
</li>
<li><strong>Docker</strong> runs on all operating systems, but requires "sudo" (i.e., admin) privileges</li>
<li><strong>Singularity</strong> can run on computational clusters (no "sudo") but is not (well) on non-Linux</li>
<li>Their containers are different, but interoperable - e.g., Singularity can use and build Docker Images</li>
</ul>
</section>
<section>
<h2>The datalad-container extension</h2>
<ul style="font-size:30px">
<li>
The <code>datalad-container</code> extension gives DataLad commands to add, track, retrieve, and
execute Docker or Singularity containers.
</li>
<pre><code>pip/conda install datalad-container</code></pre>
<li>
If this extension is installed, DataLad can register software containers as "just another file" to your
dataset, and <strong>datalad containers-run</strong> analysis inside the container, capturing software as additional
provenance
</li>
</ul>
<img class="fragment fade-in" src="../pics/containers-run.svg" height="600"> <!-- .element: class="fragment" -->
</section>
<section>
<h2>Did you know...</h2>
<ul style="font-size:30px">
Helpful resources for working with software containers:
<li>
<a href="https://github.com/jupyterhub/repo2docker" target="_blank">
repo2docker</a> can fetch a Git repository/DataLad dataset and builds
a container image from configuration files
</li>
<li>
<a href="https://github.com/ReproNim/neurodocker" target="_blank">
neurodocker</a> can generate custom Dockerfiles and Singularity recipes
for neuroimaging.
</a>
</li>
<li>
<a href="https://github.com/repronim/containers" target="_blank">
The ReproNim container collection</a>, a DataLad dataset that
includes common neuroimaging software as configured singularity containers.
</li>
<li>
<a href="https://github.com/rocker-org/rocker" target="_blank">
rocker</a> - Docker container for R users
</li>
</ul>
</section>
<section>
<h2>Prepare data</h2>
<ul>
<li>Add a script for data preparation (labels train and validation images)</li>
<li>Execute it using <code>datalad containers-run</code></li>
</ul>
</section>
<section>
<h2>Train models and evaluate them</h2>
<ul>
<li>Add scripts for training and evaluation.
This dataset state can be tagged to identify it easily at a later point</li>
<li>Execute the scripts using <code>datalad containers-run</code></li>
<li>By dumping a trained model as a joblib object the trained classifier stays reusable</li>
</ul>
</section>
</section>
<section>
<section data-markdown><script type="text/template">
# And now what?
</script></section>
<section data-markdown><script type="text/template">
## When everything is tracked: A reproducible paper
<iframe width="1120" height="630" src="https://www.youtube-nocookie.com/embed/nhLqmF58SLQ" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
- Peer-reviewed paper published in Behavior Research Methods [[DOI 10.3758/s13428-020-01428-x](https://doi.org/10.3758/s13428-020-01428-x)]<!-- .element: style="font-size:70%" -->
- Free to reproduce at https://github.com/psychoinformatics-de/paper-remodnav more details in the DataLad handbook
http://handbook.datalad.org/r.html?reproducible-paper.
- Full video: https://youtube.com/datalad
<!-- .element: style="font-size:70%" -->
Note:
- VERY useful prior publication
</script></section>
<section data-markdown><script type="text/template">
# Anticipate change!
</script></section>
<section data-markdown data-transition="none"><script type="text/template">
## Exhaustive capture enables portability
![](../pics/vamp_2_pushtocloud.png)<!-- .element: width="100%" -->
Precise identification of data and computational environments, combined for provenance records form a comprehensive and portable data structure, capturing all aspects of an investigation.
**Easily take your stuff with you, whereever and whenever you move on!**
</script></section>
<section data-markdown><script type="text/template">
## Services
![](../pics/studyforrest_on_github.png)<!-- .element: height="500" style="box-shadow: 10px 10px 8px #888888" -->
- make *the* difference for advertisment, discovery, convenience
- but imply gigantic dependencies
- often impossible to "take over"
**Make sure data/metadata are self-contained<br>to facilitate/enable transition to another service**
<aside class="notes">
Note to self
</aside>
</script>
</section>
<section data-markdown><script type="text/template">
# Is it really worth the investment?
</script></section>
<section data-markdown><script type="text/template">
## FAIRly big: Process the UK Biobank (imaging data)
![](../pics/biobank_website.png)<!-- .element: height="400" -->
- 76 TB in 43 million files in total
- 42,715 participants contributed personal health data
- Strict DUA
- Custom binary-only downloader
- Most data records offered as (unversioned) ZIP files
</script></section>
<section data-markdown><script type="text/template">
## Challenges
- Process data such that
- Results are computationally reproducible (without the original compute infrastructure)
- There is complete linkage from results to an individual data record download
- It scales with the amount of available compute resources
- Data processing pipeline
- Compiled MATLAB blob
- 1h processing time per image, with 41k images to process
- 1.2 M output files (30 output files per input file)
- 1.2 TB total size of outputs
</script></section>
<section data-markdown><script type="text/template">
## FAIRly big setup
![](../pics/fairlybig_ukbsetup.png)<!-- .element: width="1200" style="margin-top:-35px;margin-bottom:-30px" -->
- UKB DataLad extension can track the evolution of the complete data release in DataLad datasets
<!-- .element: style="font-size:80%" -->
- Full version history
<!-- .element: style="font-size:80%" -->
- Native and BIDSified data layout
<!-- .element: style="font-size:80%" -->
<note>Wagner, Waite, Wierzba, Hoffstaedter, Waite, Poldrack, Eickhoff, Hanke (2021). FAIRly big: A framework for computationally reproducible processing of large-scale data.</note>
</script></section>
<section data-markdown><script type="text/template">
## FAIRly big workflow
![](../pics/fairlybig_workflow.png)<!-- .element: width="1200" style="margin-top:-35px;margin-bottom:-30px" -->
- Common data representation in secure environments
- Content-agnostic persistent (encrypted) storage
- All computations in freshly bootstrapped emphemeral environments, only using information from a fully self-contained DataLad dataset
<note>Wagner et al. (2021). FAIRly big: A framework for computationally reproducible processing of large-scale data.</note>
</script></section>
<section data-markdown><script type="text/template">
## FAIRly big provenance capture
![](../pics/fairlybig_prov.png)<!-- .element: width="1200" style="margin-top:-35px;margin-bottom:-30px" -->
- Every single pipeline execution is tracked
- Each execution individually reproducible without HPC access
<note>Wagner et al. (2021). FAIRly big: A framework for computationally reproducible processing of large-scale data.</note>
</script></section>
<section data-markdown><script type="text/template">
## FAIRly big movie
<iframe width="1120" height="630" src="https://www.youtube-nocookie.com/embed/UsW6xN2f2jc?start=17" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
- Rendered exclusively from information captured by DataLad in the output dataset. Full video: https://youtube.com/datalad
- Two full (re-)computations, programmatically comparable, verifiable, reproducible -- on any system with data access
</script></section>
</section>
<section data-markdown><script type="text/template">
## Interoperable digital research ecosystem
![](../pics/decentralized_rdm.png)<!-- .element: width="100%" style="margin-bottom:100px" -->
<note>Hanke, Pestilli, Wagner, Markiewicz, Poline & Halchenko (2021). In defense of decentralized research data
management. Neuroforum, 72, 17-25.</note>
Note:
- Freedom of infrastructure selection
- Transitions between institutions and stewards
- Facilitate diverse collaboration
</script></section>
<section data-markdown><script type="text/template">
## Training and guideline development
![](../pics/vamp_poster.png)<!-- .element: width="1100" style="margin-top:-20px;margin-bottom:30px" -->
<note>Adina S. Wagner, Jean-Baptiste Poline, Michael Hanke. *A pragmatic approach to reusable research outputs*. <a href="https://doi.org/10.7490/f1000research.1118575.1">10.7490/f1000research.1118575.1</a> More hands-on details in the DataLad handbook at http://handbook.datalad.org</note>
</script></section>
</section>
<section>
<section>
<h2>After the workshop</h2>
<ul>
If you have a question after the workshop, you can reach out for help:
<br>
<ul style="font-size:30px">
<dt>Reach out to to the <b>DataLad</b> team via</dt>
<li>
<a href="https://matrix.to/#/!NaMjKIhMXhSicFdxAj:matrix.org?via=matrix.waite.eu&via=matrix.org&via=inm7.de" target="_blank">
Matrix</a> (free, decentralized communication app, no app needed).
We run a weekly Zoom office hour (Thursday, 4pm Berlin time) from this room as well.
</li>
<li>
<a href="https://github.com/datalad/datalad" target="_blank">
the development repository on GitHub</a>
</li>
<br>
<dt>Reach out to the user community with</dt>
<li>A question on <a href="https://neurostars.org/" target="_blank">neurostars.org</a>
with a <code>datalad</code> tag</li>
<br>
<dt>Find more user tutorials or workshop recordings</dt>
<li>On <a href="https://www.youtube.com/channel/datalad" target="_blank">
DataLad's YouTube channel</a>
</li>
<li>
In the <a href="http://handbook.datalad.org/en/latest/" target="_blank">
DataLad Handbook </a>
</li>
<li>In the <a href="https://psychoinformatics-de.github.io/rdm-course/" target="_blank">DataLad RDM course</a> </li>
<li>In the <a href="http://docs.datalad.org" target="_blank">Official API documentation</a> </li>
</ul>
</ul>
</section>
</section>
</section>
</div>
</div>
<script src="../reveal.js/dist/reveal.js"></script>
<script src="../reveal.js/plugin/notes/notes.js"></script>
<script src="../reveal.js/plugin/markdown/markdown.js"></script>
<script src="../reveal.js/plugin/highlight/highlight.js"></script>
<script>
// More info about initialization & config:
// - https://revealjs.com/initialization/
// - https://revealjs.com/config/
Reveal.initialize({
hash: true,
// The "normal" size of the presentation, aspect ratio will be preserved
// when the presentation is scaled to fit different resolutions. Can be
// specified using percentage units.
width: 1280,
height: 960,
// Factor of the display size that should remain empty around the content
margin: 0.3,
// Bounds for smallest/largest possible scale to apply to content
minScale: 0.2,
maxScale: 1.0,
controls: true,
progress: true,
history: true,
center: true,
slideNumber: 'c',
pdfSeparateFragments: false,
pdfMaxPagesPerSlide: 1,
pdfPageHeightOffset: -1,
transition: 'slide', // none/fade/slide/convex/concave/zoom
// Learn about plugins: https://revealjs.com/plugins/
plugins: [ RevealMarkdown, RevealHighlight, RevealNotes ]
});
</script>
</body>
</html>