datalad-course/html/mpsc-reproducibility.html

<!doctype html>
<html>
	<head>
		<meta charset="utf-8">
		<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no">

		<!-- Edit me start! -->
		<title>Computational reproducibility</title>
		<meta name="description" content=" Reproducible processing with DataLad ">
		<meta name="author" content=" Adina Wagner & Michael Hanke ">
		<!-- Edit me end! -->

		<link rel="stylesheet" href="../reveal.js/dist/reset.css">
		<link rel="stylesheet" href="../reveal.js/dist/reveal.css">
		<link rel="stylesheet" href="../reveal.js/dist/theme/beige.css">
        <link rel="stylesheet" href="../css/main.css">
		<!-- Theme used for syntax highlighted code -->
		<link rel="stylesheet" href="../reveal.js/plugin/highlight/monokai.css">
	</head>
	<body>
		<div class="reveal">
			<div class="slides">

  <!--...Datalad Basics...-->

  <section>
<section>
<script src="https://cdn.logwork.com/widget/countdown.js"></script>
<a href="https://logwork.com/countdown-2zu8" class="countdown-timer"
   data-style="columns" data-timezone="Europe/Berlin" data-date="2022-07-21 13:30">
   Up next: Computational reproducibility </a>
</section>
</section>

<section>
    <section>
        [Overflow area to finish everything unfinished]
    </section>

    <section data-markdown style="font-size:30px"><script type="text/template">
## Dataset management for Reproducibility & Reusability
<small>Read more at <a href="https://psychoinformatics-de.github.io/rdm-course/04-dataset-management/index.htmlhttps://psychoinformatics-de.github.io/rdm-course/04-dataset-management/index.html" target="_blank">
            psychoinformatics-de.github.io/rdm-course/04-dataset-management
        </a> </small>
![](../pics/dataset_modules.png)

When setting up data analyses...
- Create MODULAR datasets: Whenever a particular collection of files could anyhow be useful in more than one context (e.g. data), put them in their own dataset, and install it as a subdataset. <!-- .element: class="fragment fade-in-then-semi-out" -->
- Keep everything STRUCTURED: Bundle all components of one analysis into one superdataset, and within this dataset, separate code, data, output, execution environments.<!-- .element: class="fragment fade-in-then-semi-out" -->
- Keep a dataset SELF-CONTAINED with relative paths in scripts to subdatasets or files.
  Do not use absolute paths.<!-- .element: class="fragment fade-in-then-semi-out" -->

    </script>
</section>

  <section data-transition="None">
      <h2>Why Modularity?</h2>

      <ul style="font-size:30px">
          <li>1. Reuse and access management</li>
              <img src="../pics/ukb_datasets.svg" height="500px">

          </li>
          <li class="fragment fade-in" data-fragment-index="1">2. Scalability</li>
          <pre  class="fragment fade-in" data-fragment-index="1"><code class="fragment fade-in" data-fragment-index="1">adina@bulk1 in /ds/hcp/super on git:master❱ datalad status --annex -r
  15530572 annex'd files (77.9 TB recorded total size)
  nothing to save, working tree clean</code></pre>
          <small  class="fragment fade-in" data-fragment-index="1"><a href="https://github.com/datalad-datasets/human-connectome-project-openaccess" target="_blank">(github.com/datalad-datasets/human-connectome-project-openaccess)</a></small>
      </ul>

  </section>

<section style="font-size:30px" data-transition="None">
<h2>Why Modularity?</h2>
    <ul>
        <li>3. Transparency</li><br>

Original:
<pre><code class="sh" style="max-height:none" data-trim>
/dataset
├── sample1
│   └── a001.dat
├── sample2
│   └── a001.dat
...
</code></pre>
<div class="fragment">
Without modularity, after applied transform (preprocessing, analysis, ...):
<pre><code class="sh" style="max-height:none" data-trim>
/dataset
├── sample1
│   ├── ps34t.dat
│   └── a001.dat
├── sample2
│   ├── ps34t.dat
│   └── a001.dat
...
</code></pre>
Without expert/domain knowledge, no distinction between original and derived data
    possible.
</div>
        </ul>
</section>


<section  style="font-size:30px" data-transition="None">
<h2>Why Modularity?</h2>
    <ul>
        <li>3. Transparency</li><br>

Original:
<pre><code class="sh" style="max-height:none" data-trim>
/raw_dataset
├── sample1
│   └── a001.dat
├── sample2
│   └── a001.dat
...
</code></pre>
        <strong>With modularity</strong> after applied transform (preprocessing, analysis, ...)
<pre><code class="sh" style="max-height:none" data-trim>
/derived_dataset
├── sample1
│   └── ps34t.dat
├── sample2
│   └── ps34t.dat
├── ...
└── inputs
    └── raw
        ├── sample1
        │   └── a001.dat
        ├── sample2
        │   └── a001.dat
        ...
</code></pre>
Clearer separation of semantics, through use of pristine version of original dataset within a
        <em>new, additional</em> dataset holding the outputs.</ul>

</section>


<section>
    <h2>A machine-learning example</h2>

    Code along or try it later at <br>
    <a href="http://handbook.datalad.org/en/latest/usecases/ml-analysis.html" target="_blank">
    handbook.datalad.org/usecases/ml-analysis.html</a>
</a>
</section>

<section>
    <h2>Analysis layout</h2>
    <table>
        <tr>
            <td>
                <ul>
        <li>Prepare an input data set</li>
        <li class="fragment fade-in">Configure and setup an analysis dataset</li>
        <li class="fragment fade-in">Prepare data</li>
        <li class="fragment fade-in">Train models and evaluate them</li>
        <li class="fragment fade-in">Compare different models, repeat with updated data</li>
                </ul>
            </td>
            <td>
    <img src="../pics/imagenette.png" width="800">
                <small>Imagenette dataset</small>
            </td>
        </tr>
    </table>
</section>

<section>
    <h2>Prepare an input dataset</h2>
    <ul>
        <li>Create a stand-alone input dataset</li>
        <li>Either add data and <code>datalad save</code> it, or use commands such as <code>datalad download-url</code>
    or <code>datalad add-urls</code> to retrieve it from web-sources</li>
    </ul>
</section>

<section>
    <h2>Configure and setup an analysis dataset</h2>
    <ul>
        <li>Given the purpose of an analysis dataset, configurations can make it easier to use:</li>
            <ul>
                <li><code>-c yoda</code> prepares a useful structure</li>
                <li><code>-c text2git</code> keeps text files such as scripts in Git</li>
            </ul>
        <li>The input dataset is installed as a subdataset</li>
        <li>Required software is containerized and added to the dataset</li>
    </ul>
</section>


  <section data-transition="None">
    <h3>Sharing software environments: Why and how</h3>

        <p style="font-size:35px"> Science has many different building blocks: Code, software, and data produce research outputs.
        The more you share, the more likely can others reproduce your results <br></p>
        <img height="750px" src="../pics/agoodstart.png">
  </section>


<section data-transition="None">
    <h3>Sharing software environments: Why and how</h3>
    <ul style="font-size:35px">
        <li>
            Software can be difficult or impossible to install (e.g. conflicts with existing software,
            or on HPC) for you or your collaborators
        </li>
        <li>
            Different software versions/operating systems can produce different results:
            <a href="https://doi.org/10.3389/fninf.2015.00012" target="_blank">Glatard et al., doi.org/10.3389/fninf.2015.00012</a>
        </li>
        <iframe width="1200"  height="700" src="https://doi.org/10.3389/fninf.2015.00012"></iframe>
    </ul>
</section>


  <section>
      <h2>Software containers</h2>
      <ul style="font-size:35px">
          <table>
              <tr>
                  <td>
          <img src="../pics/dockerexplain.png" height="500">
                  </td>
                  <td><img height="100" src="../pics/blog_docker.png"><br>
                  <img height="100" src="../pics/singularitylogo.jpg"> </td>
              </tr>

              </table>
          </img>
          <li>
              Put simple, a cut-down virtual machine that is a portable and shareable
              bundle of software libraries and their dependencies
          </li>
          <li><strong>Docker</strong> runs on all operating systems, but requires "sudo" (i.e., admin) privileges</li>
          <li><strong>Singularity</strong> can run on computational clusters (no "sudo") but is not (well) on non-Linux</li>
          <li>Their containers are different, but interoperable - e.g., Singularity can use and build Docker Images</li>
      </ul>
  </section>

  <section>
      <h2>The datalad-container extension</h2>
      <ul style="font-size:30px">
      <li>
          The <code>datalad-container</code> extension gives DataLad commands to add, track, retrieve, and
          execute Docker or Singularity containers.
      </li>
      <pre><code>pip/conda install datalad-container</code></pre>
          <li>
              If this extension is installed, DataLad can register software containers as "just another file" to your
              dataset, and <strong>datalad containers-run</strong> analysis inside the container, capturing software as additional
          provenance
          </li>
      </ul>
      <img class="fragment fade-in" src="../pics/containers-run.svg" height="600"> <!-- .element: class="fragment" -->
  </section>


  <section>
    <h2>Did you know...</h2>
      <ul style="font-size:30px">
          Helpful resources for working with software containers:
          <li>
              <a href="https://github.com/jupyterhub/repo2docker" target="_blank">
                  repo2docker</a> can fetch a Git repository/DataLad dataset and builds
              a container image from configuration files
          </li>
          <li>
              <a href="https://github.com/ReproNim/neurodocker" target="_blank">
                  neurodocker</a> can generate custom Dockerfiles and Singularity recipes
              for neuroimaging.
              </a>
          </li>
          <li>
              <a href="https://github.com/repronim/containers" target="_blank">
                  The ReproNim container collection</a>, a DataLad dataset that
              includes common neuroimaging software as configured singularity containers.
          </li>
          <li>
              <a href="https://github.com/rocker-org/rocker" target="_blank">
                  rocker</a> - Docker container for R users
          </li>
      </ul>
  </section>

<section>
    <h2>Prepare data</h2>
    <ul>
        <li>Add a script for data preparation (labels train and validation images)</li>
        <li>Execute it using <code>datalad containers-run</code></li>
    </ul>
</section>

<section>
    <h2>Train models and evaluate them</h2>
    <ul>
        <li>Add scripts for training and evaluation.
            This dataset state can be tagged to identify it easily at a later point</li>
        <li>Execute the scripts using <code>datalad containers-run</code></li>
        <li>By dumping a trained model as a joblib object the trained classifier stays reusable</li>
    </ul>
</section>
</section>

<section>
<section data-markdown><script type="text/template">
# And now what?
</script></section>
<section data-markdown><script type="text/template">
## When everything is tracked: A reproducible paper

<iframe width="1120" height="630" src="https://www.youtube-nocookie.com/embed/nhLqmF58SLQ" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

- Peer-reviewed paper published in Behavior Research Methods [[DOI 10.3758/s13428-020-01428-x](https://doi.org/10.3758/s13428-020-01428-x)]<!-- .element: style="font-size:70%" -->

- Free to reproduce at https://github.com/psychoinformatics-de/paper-remodnav more details in the DataLad handbook
  http://handbook.datalad.org/r.html?reproducible-paper.

- Full video: https://youtube.com/datalad

<!-- .element: style="font-size:70%" -->
Note:
- VERY useful prior publication
</script></section>

<section data-markdown><script type="text/template">
# Anticipate change!
</script></section>

<section data-markdown data-transition="none"><script type="text/template">
## Exhaustive capture enables portability
![](../pics/vamp_2_pushtocloud.png)<!-- .element: width="100%" -->
Precise identification of data and computational environments, combined for provenance records form a comprehensive and portable data structure, capturing all aspects of an investigation.

**Easily take your stuff with you, whereever and whenever you move on!**
</script></section>

<section data-markdown><script type="text/template">
## Services
![](../pics/studyforrest_on_github.png)<!-- .element: height="500" style="box-shadow: 10px 10px 8px #888888" -->

- make *the* difference for advertisment, discovery, convenience
- but imply gigantic dependencies
- often impossible to "take over"

**Make sure data/metadata are self-contained<br>to facilitate/enable transition to another service**
<aside class="notes">
Note to self
</aside>
</script>
</section>

<section data-markdown><script type="text/template">
# Is it really worth the investment?
</script></section>

<section data-markdown><script type="text/template">
## FAIRly big: Process the UK Biobank (imaging data)
![](../pics/biobank_website.png)<!-- .element: height="400" -->

- 76 TB in 43 million files in total
- 42,715 participants contributed personal health data
- Strict DUA
- Custom binary-only downloader
- Most data records offered as (unversioned) ZIP files
</script></section>

<section data-markdown><script type="text/template">
## Challenges

- Process data such that
  - Results are computationally reproducible (without the original compute infrastructure)
  - There is complete linkage from results to an individual data record download
  - It scales with the amount of available compute resources

- Data processing pipeline
  - Compiled MATLAB blob
  - 1h processing time per image, with 41k images to process
  - 1.2 M output files (30 output files per input file)
  - 1.2 TB total size of outputs
</script></section>

<section data-markdown><script type="text/template">
## FAIRly big setup
![](../pics/fairlybig_ukbsetup.png)<!-- .element: width="1200" style="margin-top:-35px;margin-bottom:-30px" -->

- UKB DataLad extension can track the evolution of the complete data release in DataLad datasets
<!-- .element: style="font-size:80%" -->
- Full version history
<!-- .element: style="font-size:80%" -->
- Native and BIDSified data layout
<!-- .element: style="font-size:80%" -->

<note>Wagner, Waite, Wierzba, Hoffstaedter, Waite, Poldrack, Eickhoff, Hanke (2021). FAIRly big: A framework for computationally reproducible processing of large-scale data.</note>
</script></section>

<section data-markdown><script type="text/template">
## FAIRly big workflow
![](../pics/fairlybig_workflow.png)<!-- .element: width="1200" style="margin-top:-35px;margin-bottom:-30px" -->
- Common data representation in secure environments
- Content-agnostic persistent (encrypted) storage
- All computations in freshly bootstrapped emphemeral environments, only using information from a fully self-contained  DataLad dataset
<note>Wagner et al. (2021). FAIRly big: A framework for computationally reproducible processing of large-scale data.</note>
</script></section>

<section data-markdown><script type="text/template">
## FAIRly big provenance capture
![](../pics/fairlybig_prov.png)<!-- .element: width="1200" style="margin-top:-35px;margin-bottom:-30px" -->

- Every single pipeline execution is tracked
- Each execution individually reproducible without HPC access
<note>Wagner et al. (2021). FAIRly big: A framework for computationally reproducible processing of large-scale data.</note>
</script></section>

<section data-markdown><script type="text/template">
## FAIRly big movie

<iframe width="1120" height="630" src="https://www.youtube-nocookie.com/embed/UsW6xN2f2jc?start=17" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

- Rendered exclusively from information captured by DataLad in the output dataset. Full video: https://youtube.com/datalad
- Two full (re-)computations, programmatically comparable, verifiable, reproducible -- on any system with data access
</script></section>
</section>

<section data-markdown><script type="text/template">
## Interoperable digital research ecosystem

![](../pics/decentralized_rdm.png)<!-- .element: width="100%" style="margin-bottom:100px" -->

<note>Hanke, Pestilli, Wagner, Markiewicz, Poline & Halchenko (2021). In defense of decentralized research data
management. Neuroforum, 72, 17-25.</note>

Note:
- Freedom of infrastructure selection
- Transitions between institutions and stewards
- Facilitate diverse collaboration
</script></section>

<section data-markdown><script type="text/template">
## Training and guideline development
![](../pics/vamp_poster.png)<!-- .element: width="1100" style="margin-top:-20px;margin-bottom:30px" -->

<note>Adina S. Wagner, Jean-Baptiste Poline, Michael Hanke. *A pragmatic approach to reusable research outputs*. <a href="https://doi.org/10.7490/f1000research.1118575.1">10.7490/f1000research.1118575.1</a> More hands-on details in the DataLad handbook at http://handbook.datalad.org</note>
</script></section>

</section>

<section>
    <section>
      <h2>After the workshop</h2>
      <ul>
              If you have a question after the workshop, you can reach out for help:
          <br>
          <ul style="font-size:30px">
              <dt>Reach out to to the <b>DataLad</b> team via</dt>
            <li>
                <a href="https://matrix.to/#/!NaMjKIhMXhSicFdxAj:matrix.org?via=matrix.waite.eu&via=matrix.org&via=inm7.de" target="_blank">
                    Matrix</a> (free, decentralized communication app, no app needed).
                    We run a weekly Zoom office hour (Thursday, 4pm Berlin time) from this room as well.
            </li>
            <li>
                <a href="https://github.com/datalad/datalad" target="_blank">
                the development repository on GitHub</a>
            </li>
              <br>
              <dt>Reach out to the user community with</dt>
              <li>A question on <a href="https://neurostars.org/" target="_blank">neurostars.org</a>
              with a <code>datalad</code> tag</li>
              <br>
              <dt>Find more user tutorials or workshop recordings</dt>
              <li>On <a href="https://www.youtube.com/channel/datalad" target="_blank">
                  DataLad's YouTube channel</a>
              </li>
              <li>
                  In the <a href="http://handbook.datalad.org/en/latest/" target="_blank">
                  DataLad Handbook </a>
              </li>
              <li>In the <a href="https://psychoinformatics-de.github.io/rdm-course/" target="_blank">DataLad RDM course</a> </li>
              <li>In the <a href="http://docs.datalad.org" target="_blank">Official API documentation</a> </li>
          </ul>
      </ul>
  </section>
</section>

  </section>


			</div>
		</div>

		<script src="../reveal.js/dist/reveal.js"></script>
		<script src="../reveal.js/plugin/notes/notes.js"></script>
		<script src="../reveal.js/plugin/markdown/markdown.js"></script>
		<script src="../reveal.js/plugin/highlight/highlight.js"></script>
		<script>
			// More info about initialization & config:
			// - https://revealjs.com/initialization/
			// - https://revealjs.com/config/
			Reveal.initialize({
				hash: true,
				// The "normal" size of the presentation, aspect ratio will be preserved
				// when the presentation is scaled to fit different resolutions. Can be
				// specified using percentage units.
				width: 1280,
				height: 960,
				// Factor of the display size that should remain empty around the content
				margin: 0.3,
				// Bounds for smallest/largest possible scale to apply to content
				minScale: 0.2,
				maxScale: 1.0,

				controls: true,
				progress: true,
				history: true,
				center: true,
				slideNumber: 'c',
				pdfSeparateFragments: false,
				pdfMaxPagesPerSlide: 1,
				pdfPageHeightOffset: -1,
				transition: 'slide', // none/fade/slide/convex/concave/zoom
				// Learn about plugins: https://revealjs.com/plugins/
				plugins: [ RevealMarkdown, RevealHighlight, RevealNotes ]
			});
		</script>
	</body>
</html>