datalad-course/html/helmholtz-reproducibility.html

<!doctype html>
<html>
	<head>
		<meta charset="utf-8">
		<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no">

		<!-- Edit me start! -->
		<title>Reproducibility with DataLad</title>
		<meta name="description" content=" Data & Reproducibility Management with DataLad ">
		<meta name="author" content=" Adina Wagner ">
		<!-- Edit me end! -->

		<link rel="stylesheet" href="../reveal.js/dist/reset.css">
		<link rel="stylesheet" href="../reveal.js/dist/reveal.css">
		<link rel="stylesheet" href="../reveal.js/dist/theme/beige.css">
        <link rel="stylesheet" href="../css/main.css">
		<!-- Theme used for syntax highlighted code -->
		<link rel="stylesheet" href="../reveal.js/plugin/highlight/monokai.css">
	</head>
	<body>
		<div class="reveal">
			<div class="slides">

  <!--...Datalad Basics...-->

  <section>


<section>
<h2>Data and Reproducibility Management with DataLad</h2>

  <div style="margin-top:1em;text-align:center">
  <table style="border: none;">
  <tr>
	<td style="border: none;">Adina Wagner
	  <br><small>
		<a href="https://mas.to/@adswa" target="_blank">
		  <img data-src="../pics/mastodon.svg" style="height:30px;margin:0px" />
		  mas.to/@adswa</a></small></td>
    <td style="border: none;">
	  <br></td>
  </tr>
  <tr>
    <td style="border: none; vertical-align:top">
        <small><a href="http://psychoinformatics.de" target="_blank">Psychoinformatics lab</a>,
          <br> Institute of Neuroscience and
          Medicine, Brain &amp; Behavior (INM-7)<br>
       Research Center Jülich</small><br>
    </td>
      <td><img style="height:100px;margin-right:10px" data-src="../pics/fzj_logo.png" /></td>
  </tr>
  </table>
  </div>
        <p style="z-index: 100;position: fixed;background-color:#ede6d5;font-size:35px;box-shadow: 10px 10px 8px #888888;margin-top:0px;margin-bottom:100px;margin-left:1000px">
        <img src="../pics/qr_hidarepro.png" height="200">
    </p>
<br><br><small>

    Slides: <a href="https://doi.org/10.5281/zenodo.10118794" target="_blank">
    DOI 10.5281/zenodo.10118794</a> (Scan the QR code) <br>
    <a href="https://files.inm7.de/adina/talks/html/helmholtz-reproducibility"
       target="_blank">files.inm7.de/adina/talks/html/helmholtz-reproducibility.html</a>
</small>


</a>
</section>


  <section>
  <h2>Logistics</h2>
  <ul style="font-size:35px">
      <li class="fragment fade-in">
          Collaborative, public notes, networking, & anonymous questions at <a href="https://etherpad.wikimedia.org/p/reproducibility-with-datalad" target="_blank">
          etherpad.wikimedia.org/p/reproducibility-with-datalad</a>
      </li>
      <br>
      <li class="fragment fade-in">
          We are using a JupyterHub at <a href="https://datalad-hub.inm7.de" target="_blank">datalad-hub.inm7.de</a>.
          Draw a username from a jar! <br>
          You can log in with a password of your choice.
      </li>
      <br>

      <li class="fragment fade-in">
          Format:
      </li>
      <ul class="fragment fade-in">
          <li>Mostly hands-on: Watch me live-code, and try out the software
              yourself in the browser. Conceptual wrap-up at the end.</li>
          <li>Ask questions any time </li>
          <li>Quick ☕-break after ~1 hour</li>
      </ul>
  </ul>
  </section>

  <section>
      <h2>Further resources and stay in touch</h2>
      <ul>
              If you have questions after the workshop...
          <br><br>
          <ul style="font-size:35px">
              <dt>Reach out to to the <b>DataLad</b> team via</dt>
            <li>
                <a href="https://matrix.to/#/!NaMjKIhMXhSicFdxAj:matrix.org?via=matrix.waite.eu&via=matrix.org&via=inm7.de" target="_blank">
                    Matrix</a> (free, decentralized communication app, no app needed).
                    We run a weekly Zoom office hour (Tuesday, 4pm Berlin time) from this room as well.
            </li>
            <li>
                <a href="https://github.com/datalad/datalad" target="_blank">
                The development repository on GitHub</a>
            </li>
              <br>
              <dt>Reach out to the (Neuro-) user community with</dt>
              <li>A question on <a href="https://neurostars.org/" target="_blank">neurostars.org</a>
              with a <code>datalad</code> tag</li>
              <br>
              <dt>Find more user tutorials or workshop recordings</dt>
              <li>On <a href="https://www.youtube.com/datalad" target="_blank">
                  DataLad's YouTube channel</a>
              </li>
              <li>
                  In the <a href="http://handbook.datalad.org/en/latest/" target="_blank">
                  DataLad Handbook </a>
              </li>
              <li>In the <a href="https://psychoinformatics-de.github.io/rdm-course/" target="_blank">DataLad RDM course</a> </li>
              <li>In the <a href="http://docs.datalad.org" target="_blank">Official API documentation</a> </li>
              <li> In an overview of most tutorials, talks, videos at
              <a href="https://github.com/datalad/tutorials" target="_blank">github.com/datalad/tutorials</a> </li>
          </ul>
      </ul>
  </section>

<section>
  <h2>Acknowledgements</h2>
  <table>
  <tr style="vertical-align:middle">
    <td style="vertical-align:middle">
      <dl>
        <dt style="margin-top:20px">DataLad software <br>
            & ecosystem</dt>
        <dd style="margin-left:5px!important">
          <ul style="margin-left:5px!important">
              <li>Psychoinformatics Lab, <br>
              Research center Jülich</li>
              <li>Center for Open <br>
              Neuroscience, <br>
              Dartmouth College</li>
              <li>Joey Hess (git-annex)</li>
              <li><em>>100 additional contributors</em></li>
          </ul>
        </dd>
    </td>
    <td style="vertical-align:middle">
  <div style="margin-bottom:-20px;text-align:center"><strong>Funders</strong></div>
  <img style="height:150px;margin-right:50px" data-src="../pics/nsf.png" />
  <img style="height:150px;margin-right:50pxi;margin-left:50px" data-src="../pics/binc.png" />
  <img style="height:150px;margin-left:50px" data-src="../pics/bmbf.png" />
  <div style="margin-top:-20px">
  <img style="height:80px;margin-top:-40px;margin-left:40px" data-src="../pics/fzj_logo.svg" />
  <img style="height:60px;margin-left:50px;margin-bottom:25px" data-src="../pics/dfg_logo.png" />
  </div>
  <div style="margin-top:-20px">
  <img style="height:60px;margin-right:20px" data-src="../pics/erdf.png" />
  <img style="height:60px;margin-right:20px" data-src="../pics/cbbs_logo.png" />
  <img style="height:60px" data-src="../pics/LSA-Logo.png" />
  </div>
  <div style="margin-top:40px;margin-bottom:20px;text-align:center"><strong>Collaborators</strong></div>
  <div style="margin-top:-20px">
  <img style="height:100px;margin:20px" data-src="../pics/hbp_logo.png" />
  <img style="height:100px;margin:20px" data-src="../pics/conp_logo.png" />
  <img style="height:120px;margin:10px" data-src="../pics/openneuro_logo.png" />
  </div>
  <div style="margin-top:-40px">
  <img style="height:100px;margin:20px" data-src="../pics/ebrains-logo.png"/>
  <img style="height:100px;margin:0px" data-src="../pics/gin-logo.png" />
  <img style="height:120px;margin:10px" data-src="../pics/sfb1451_logo.png" />
</div>
  <div style="margin-top:-40px;align:middle">
  <img style="height:140px;margin:10px" data-src="../pics/brainlife_logo.png" />
  <img style="height:100px;margin:0px" data-src="../pics/cbrain_logo.png" />
  <img style="height:100px;margin:20px" data-src="../pics/vbc_logo.png" />
  </div>
  </td>
  </tr>
  </table>
</section>

  <section>
      <h3>DataLad usecases</h3>
      <div class="r-stack">
        <li data-fragment-index="1" class="fragment fade-in-then-out"> <b>Publish or consume datasets</b>
        via GitHub, GitLab, OSF, the European Open Science Cloud, or similar services</li>
        <li data-fragment-index="2" class="fragment fade-in-then-out">
        Behind-the-scenes <b>infrastructure component for data transport and versioning</b>
        (e.g., used by <a href="https://openneuro.org/" target="_blank"> OpenNeuro</a>,
        <a href="https://brainlife.io/" target="_blank"> brainlife.io </a>,
        the <a href="https://conp.ca/" target="_blank">Canadian Open Neuroscience Platform (CONP)</a>,
        <a href="https://mcin.ca/technology/cbrain/" target="_blank"> CBRAIN</a>)</li>
        <li data-fragment-index="3" class="fragment fade-in-then-out"><b>Central data management</b> and archival system</li>
        <li data-fragment-index="4" class="fragment fade-in-then-out"><b>Decentral data and metadata catalog</b></li>
        <li data-fragment-index="5" class="fragment fade-in-then-out"> <b>Creating and sharing reproducible, open science</b>: Sharing data, software, code, and provenance </li>
      </div>
      <div class="r-stack">
        <img data-fragment-index="1" height="700" class="fragment fade-in-then-out" src="../pics/getdata_studyforrest.gif" alt="a screenrecording of cloning studyforrest data from github">
        <img height="700" class="fragment fade-in-then-out" data-fragment-index="2" src="../pics/openneuro_new_2.gif" alt="a screenrecording of browsing open neuro">
        <img height="700" data-fragment-index="3" class="fragment fade-in-then-out" src="../pics/centralmanagement2.gif">
        <img height="1000" data-fragment-index="4" class="fragment fade-in-then-out" src="../pics/sfb-catalog.gif">
        <img height="700" class="fragment fade-in" data-fragment-index="5" src="../pics/remodnavpaper_2.gif" alt="a screenrecording of cloning REMODNAV paper dataset from github">
      </div>
  </section>
</section>


<!-------Examples-------->

<section>

<section data-transition="None">
    <h2>A common usecase</h2>
    <div style="margin-top:0.5em;">
        <table style="border: none;table-layout: fixed;">
            <tr>
                <td width="60%"><img style="height:500px; margin-top: 0; margin-right:1px;vertical-align:middle;" data-src="../pics/comic_box1.svg" /></td>
                <td>
                    <ul style="vertical-align:middle;">
                        <li class="fragment fade-in">
                            Alice is a PhD student in a research team.</li>
                        <li class="fragment fade-in">
                            She works on a fairly typical research project:
                            Data collection & processing.</li>
                        <li class="fragment fade-in">
                            First sample → final result = complex process</li>
                    </ul>
                </td>
            </tr>
        </table>
    </div><br>
    <h3 class="fragment fade-in">How does Alice go about her daily job?</h3>
</section>


<section data-transition="None">
    <h2>A common usecase</h2>
    <ul>
        <li class="fragment fade-in">
            In her project, Alice likes to have an automated record of:
            <ul>
                <li>when a given file was last changed</li>
                <li>where it came from</li>
                <li>what input files were used to generate a given output</li>
                <li>why some things were done.</li>
            </ul>
        </li>
        <br>
        <li class="fragment fade-in">
            Even if she doesn't share her work, this is essential for her future self</li>
        <li class="fragment fade-in">
            Her project is exploratory: Frequent changes to her analysis scripts</li>
        <li class="fragment fade-in">
            She enjoys the comfort of being able to return to a previously recorded state</li>
    </ul>
    <br><br>
    <h3 class="fragment fade-in">This is: *local version control*</h3>
</section>


<section data-transition="None">
    <h2>A common usecase</h2>
    <ul>
        <li class="fragment fade-in" data-fragment-index="1">
            Alice's work is not confined to a single computer:
            <ul>
                <li>Laptop / desktop / remote server / dedicated back-up</li>
                <li>Alice wants to automatically & efficiently synchronize</li>
            </ul>
        </li>
        <br>
        <li class="fragment fade-in" data-fragment-index="2">
            Parts of the data are collected or analyzed by colleagues.
            This requires:
            <ul>
                <li>distributed synchronization with centralized storage</li>
                <li>preservation of origin & authorship of changes</li>
                <li>effective combination of simultaneous contributions</li>
            </ul>
        </li>
    </ul>
    <br><br>
    <h3 class="fragment fade-in" data-fragment-index="3">This is: *distributed version control*</h3>
</section>


<section data-transition="None">
    <h2>A common usecase</h2>
    <ul>
        <li class="fragment fade-in">
            Alice applies local version control for her own work, and reproducibly records it
        </li>
        <li class="fragment fade-in">
            She also applies distributed version control when working with colleagues
            and collaborators
        </li>
        <li class="fragment fade-in">
            She often needs to work on a subset of data at any given time:
            <ul>
                <li>all files are kept on a server</li>
                <li>a few files are rotated into and out of her laptop</li>
            </ul>
        </li>
        <li class="fragment fade-in">
            Alice wants to publish the data at project's end:
            <ul>
                <li>raw data / outputs / both</li>
                <li>completely or selectively</li>
            </ul>
        </li>
    </ul>
    <br><br>
    <h3 class="fragment fade-in">This is: *data management (with DataLad 😀)*</h3>
</section>
</section>

<section>
<section>
    <h2>DataLad</h2>
            <img style="height:300px; margin-top: 0; margin-right:1px;vertical-align:middle;" src="../pics/comic_box3.svg" alt="">
    <br>
                    <ul style="font-size:37px">
                        <li>Domain-agnostic <strong>command-line tool</strong>
                            (+ <strong>graphical user interface</strong>),
            built on top of <a href="https://git-scm.com/" target="_blank">Git</a>
            & <a href="https://git-annex.branchable.com/" target="_blank">Git-annex</a></li>
        <li>Major features:</li>
        <dt>Version-controlling arbitrarily large content </dt>
        <dd>Version control data & software alongside to code!</dd>
        <dt>Transport mechanisms for sharing & obtaining data </dt>
        <dd>Consume & collaborate on data (analyses) like software</dd>
        <dt>(Computationally) reproducible data analysis</dt>
        <dd>Track and share provenance of all digital objects</dd>
        <dt>(... and <i>much</i> more) </dt>
        <br>
    </ul>


</section>


<section>
    <h2>Let's try it out</h2>
    <img src="../pics/jupyterhub-login.png">
    <dl style="font-size:37px">
        <a href="https://datalad-hub.inm7.de" target="_blank">datalad-hub.inm7.de</a>
    <dt>username:</dt>
    <dd>The spice or herb you drew as a user name</dd>
    <dt>password:</dt>
    <dd>Set at first login, at least 8 characters</dd>
        </dl>
    <p class="fragment fade-in"><strong>Important!</strong> The Hub is a shared resource. Don't fill it up :)</p>
</section>

<section style="text-align: left;">
    <h3>Git identity setup</h3>
    Check Git identity:
    <pre style="margin-left: 0;">
        <code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
            git config --get user.name
            git config --get user.email
        </code>
    </pre>

    <div class="fragment">
        Configure Git identity:
        <pre style="margin-left: 0;">
            <code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
                git config --global user.name "Adina Wagner"
                git config --global user.email "adina.wagner@t-online.de"
            </code>
        </pre>
    </div>

        <div class="fragment">
        Configure DataLad to use latest features:
        <pre style="margin-left: 0;">
            <code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
                git config --global --add datalad.extensions.load next
            </code>
        </pre>
    </div>

</section>

<section style="text-align: left;">
    <h3>Using DataLad in a terminal</h3>

    Check the installed version:
    <pre style="margin-left: 0;">
        <code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
            datalad --version
        </code>
        <p id="displayArea"></p>
    </pre>

    <div class="fragment">
        For help on using DataLad from the command line:
        <pre style="margin-left: 0;">
            <code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
                datalad --help
            </code>
            The help may be displayed in a pager - exit it by pressing "q"
        </pre>
    </div>

    <div class="fragment">
        For extensive info about the installed package, its dependencies, and extensions, use <code>datalad wtf</code>.
        Let's find out what kind of system we're on:
        <pre style="margin-left: 0;">
            <code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
                datalad wtf -S system
            </code>
        </pre>
    </div>
</section>


<section style="text-align: left;">
    <h3>Using datalad via its Python API</h3>
    Open a Python environment:
    <pre style="margin-left: 0;">
        <code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
            ipython
        </code>
    </pre>
    <div class="fragment">
        Import and start using:
        <pre style="margin-left: 0;">
            <code data-trim class="language-python" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
                import datalad.api as dl
                dl.create(path='mydataset')
            </code>
        </pre>
    </div>
    <div class="fragment">
        Exit the Python environment:
        <pre style="margin-left: 0;">
            <code data-trim class="language-python" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
                exit
            </code>
        </pre>
    </div>
</section>
</section>

<section>
<section>
    <h3 style="text-align: left;">Datalad datasets...</h3>
    <img src="../pics/comic_box4.svg" alt="">
</section>


<section style="text-align: left;">
    <h3>...Datalad datasets</h3>
    Create a dataset (here, with the <code>yoda</code> configuration, which adds
    a helpful structure and configuration for data analyses): <br>
    <img height="100px" src="../pics/yoda.png">
    <pre style="margin-left: 0;">
        <code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
            datalad create -c yoda my-analysis
        </code>
    </pre>

    <div class="fragment">
        Let's have a look inside. Navigate using <code>cd</code> (change directory):
        <pre style="margin-left: 0;">
            <code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
                cd my-analysis
            </code>
        </pre>
    </div>

    <div class="fragment">
        List the directory content, including hidden files, with <code>ls</code>:
        <pre style="margin-left: 0;">
            <code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
                ls -la .
            </code>
        </pre>
    </div>
</section>
</section>

<section>
<section>
    <h3 style="text-align: left;">Version control...</h3>
    <img src="../pics/comic_box5.svg" alt="">
</section>


<section style="text-align: left;">
    <h3>...Version control</h3>
    The yoda-configuration added a README placeholder in the dataset.
    Let's add Markdown text (a project title) to it:
    <pre style="margin-left: 0;">
        <code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
            echo "# My example DataLad dataset" > README.md
        </code>
    </pre>

    <div class="fragment">
        Now we can check the <code>status</code> of the dataset:
        <pre style="margin-left: 0;">
            <code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
                datalad status
            </code>
        </pre>
    </div>

    <div class="fragment">
        We can save the state with <code>save</code>
        <pre style="margin-left: 0;">
            <code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
                datalad save -m "Add project title into the README"
            </code>
        </pre>
    </div>

    <div class="fragment">
        Further modifications:
        <pre style="margin-left: 0;">
            <code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
                echo "Contains a small data analysis for my project" >> README.md
            </code>
        </pre>
    </div>

    <div class="fragment">
        You can also checkout what has changed:
        <pre style="margin-left: 0;">
            <code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
                git diff
            </code>
        </pre>
    </div>

    <div class="fragment">
        Save again:
        <pre style="margin-left: 0;">
            <code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
                datalad save -m "Add information on the dataset contents to the README"
            </code>
        </pre>
    </div>
</section>

<section  style="text-align: left;">
    <h3>...Version control</h3>
        <div class="fragment">
        Now, let's check the dataset history:
        <pre style="margin-left: 0;">
            <code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
                git log
            </code>
        </pre>
    </div>

    <div class="fragment">
        We can also make the history prettier:
        <pre style="margin-left: 0;">
            <code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
                tig
            </code>
            (navigate with arrow keys and enter, press "q" to go back and exit the program)
        </pre>
    </div>

    <div class="fragment">
        Convenience functions make downloads easier. Let's add code for a data analysis from an external source:
        <pre style="margin-left: 0;">
            <code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">datalad download-url -m "Add an analysis script" \
  -O code/classification_analysis.py \
  https://raw.githubusercontent.com/datalad-handbook/resources/master/classification_analysis.py
            </code>
        </pre>
    </div>

    <div class="fragment">
        Check out the file's history:
        <pre style="margin-left: 0;">
            <code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">git log code/classification_analysis.py</code>
        </pre>
    </div>
</section>

  <section>
      <h2>Local version control</h2>

      <p>Procedurally, version control is easy with DataLad!</p>
      <img class="fragment fade-in" src="../pics/local_wf.svg" height="500"> <!-- .element: class="fragment" -->
      <br>

      <b class="fragment fade-in">Advice:</b>
      <ul>
        <li class="fragment fade-in">Save <i>meaningful</i> units of change</li>
        <li class="fragment fade-in">Attach helpful commit messages</li>
      </ul>
  </section>
</section>

<section>
<section>
    <h3 style="text-align: left;">Computationally reproducible execution I...</h3>
    <img src="../pics/comic_box7.svg" width="65%" alt="">
    <ul>
        <li class="fragment fade-in-then-semi-out">which script/pipeline version</li>
        <li class="fragment fade-in-then-semi-out">was run on which version of the data</li>
        <li class="fragment fade-in-then-semi-out">to produce which version of the results?</li>
    </ul>
</section>

<section style="text-align:left;">
    <h3>... Computationally reproducible execution I</h3>
    <div class="fragment">
        A variety of processes can modify files. A simple example: Code formatting
            <pre style="margin-left: 0;">
            <code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">black code/classification_analysis.py</code>
        </pre>
    </div>

    <div class="fragment">
        Version control makes changes transparent:
            <pre style="margin-left: 0;">
            <code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">git diff</code>
        </pre>
    </div>

    <div class="fragment">
        But its useful to keep track beyond that. Let's discard the latest changes...
            <pre style="margin-left: 0;">
            <code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">git restore code/classification_analysis.py</code>
        </pre>
    </div>

    <div class="fragment">
        ... and record precisely what we did
            <pre style="margin-left: 0;">
            <code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">datalad run -m "Reformat code with black" \
 "black code/classification_analysis.py"</code>
        </pre>
    </div>

    <div class="fragment">
        let's take a look:
            <pre style="margin-left: 0;">
            <code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">git show</code>
        </pre>
    </div>

    <div class="fragment">
        ... and repeat!
            <pre style="margin-left: 0;">
            <code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">datalad rerun</code>
        </pre>
    </div>
</section>
</section>

<section>
<section>
    <h3 style="text-align: left;">Data consumption & transport...</h3>
    <img src="../pics/comic_box6_consumption.svg" alt="">
</section>


<section style="text-align: left;">
    <h3>...Data consumption & transport...</h3>

    You can install a dataset from remote URL (or local path) using <code>clone</code>.
    Either as a stand-alone entity:
    <pre style="margin-left: 0;">
        <code data-trim class="language-bash" >
            # just an example:
            datalad clone \
            https://github.com/psychoinformatics-de/studyforrest-data-phase2.git
        </code>
    </pre>

    <div class="fragment">
        Or as linked dataset, nested in another dataset in a superdataset-subdataset hierarchy:
    <pre style="margin-left: 0;">
        <code data-trim class="language-bash" >
            # just an example:
            datalad clone -d . \
            https://github.com/psychoinformatics-de/studyforrest-data-phase2.git
        </code>
    </pre>
    <img src="../pics/linkage_subds.png" alt="">
    </div>
    <ul style="font-size:30px" class="fragment">
        <li>Helps with scaling (see e.g. the <a href="https://github.com/datalad-datasets/human-connectome-project-openaccess" target="_blank">Human Connectome Project dataset</a> )</li>
        <li>Version control tools struggle with >100k files</li>
        <li>Modular units improves intuitive structure and reuse potential</li>
        <li>Versioned linkage of inputs for reproducibility</li>
    </ul>
</section>


<section style="text-align: left;">
    <h3>...Dataset nesting</h3>

    Let's make a nest!
    <div class="fragment">
        Clone a dataset with analysis data into a specific
        location ("input/") in the existing dataset,
        making it a <em>sub</em>dataset:
        <pre style="margin-left: 0;">
            <code class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">datalad clone --dataset . \
   https://github.com/datalad-handbook/iris_data.git \
   input/</code>
        </pre>
    </div>

    <div class="fragment">
        Let's see what changed in the dataset, using the <code>subdatasets</code> command:
        <pre style="margin-left: 0;">
            <code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
                datalad subdatasets
            </code>
        </pre>
    </div>
    <div class="fragment">
        ... and also <code>git show</code>:
        <pre style="margin-left: 0;">
            <code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
                git show
            </code>
        </pre>
    </div>
</section>

<section style="text-align:left;">
    <div class="fragment">
        We can now view the cloned dataset's file tree:
        <pre style="margin-left: 0;">
            <code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
                cd input
                ls
            </code>
        </pre>
    </div>

    <div class="fragment">
        ...and also its history
        <pre style="margin-left: 0;">
            <code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
                tig
            </code>
        </pre>
    </div>

    <div class="fragment">
        Let's check the dataset size (with the <code>du</code> disk-usage command):
        <pre style="margin-left: 0;">
            <code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
                du -sh
            </code>
        </pre>
    </div>

    <div class="fragment">
        Let's check the <em>actual</em> dataset size:
        <pre style="margin-left: 0;">
            <code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
                datalad status --annex
            </code>
        </pre>
    </div>

    <div class="fragment">
        Let's check try to print the file contents into the terminal (<code>cat</code>):
        <pre style="margin-left: 0;">
            <code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
                cat iris.csv
            </code>
        </pre>
    </div>


</section>


<section style="text-align: left;">
    <h3>...Data consumption & transport</h3>

    We can retrieve actual file content with <code>get</code>:
    <pre style="margin-left: 0;">
        <code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
            datalad get iris.csv
        </code>
    </pre>

    <div class="fragment">
        If we don't need a file locally anymore, we can <code>drop</code> its content:
        <pre style="margin-left: 0;">
            <code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
                datalad drop iris.csv</code>
        </pre>
    </div>
    <div class="fragment">
        No need to store all files locally, or archive results with
        Giga/Terra-Bytes of source data:
        <pre><code class="python">dl.get('input/sub-01')
[really complex analysis]
dl.drop('input/sub-01')</code></pre>
        If data is published anywhere, your data analysis can carry an actionable link to it,
        with barely any space requirements.
    </div>
</section>


  <section>
      <h2>Git versus Git-annex</h2>
      <dl>
          <dt>Data in datasets is either stored in Git or git-annex</dt>
          <dd>By default, everything is <i>annexed</i>, i.e., stored in a dataset annex by git-annex</dd><br>
      <img height="500" src="../pics/artwork/src/publishing/publishing_gitvsannex.svg">
          <br><br>
          <li class="fragment fade-in-then-semi-out">With annexed data, only content identity (hash)
              and location information is put into Git, rather than file content.
              The annex, and transport to and from it is managed with <b>git-annex</b>
      </dl>
  </section>

  <section>
      <h2>Git versus Git-annex</h2>
      <dl>
          <dt>Configurations (e.g., YODA), custom <a href="http://handbook.datalad.org/en/latest/basics/101-123-config2.html" target="_blank">
              rules</a>, or command parametrization determines if a file is annexed</dt>
          <dd>Storing files in Git or git-annex has distinct advantages:</dd><br>

          <br>

          <table >
              <tr style="font-size:35px">
                  <td><b>Git</b></td>
                  <td><b>git-annex</b></td>
              </tr>
              <tr style="font-size:30px">
                  <td>handles <b>small</b> files well (text, code)</td>
                  <td>handles <b>all</b> types and sizes of files well</td>
              </tr>
              <tr style="font-size:30px">
                  <td>file contents are in the Git history
                      and will be <b>shared</b> upon git/datalad push</td>
                  <td>file contents are in the annex. Not necessarily shared</td>
              </tr>
              <tr style="font-size:30px">
                  <td>Shared with every dataset clone</td>
                  <td><b>Can be kept private</b> on a per-file level when sharing the dataset</td>
              </tr>
              <tr style="font-size:30px">
                  <td>Useful: Small, non-binary, frequently modified, need-to-be-accessible (DUA, README) files </td>
                  <td>Useful: Large files, private files</td>
              </tr>
          </table>
          <br><br>
          <div style="text-align:center" class="fragment">YODA configures the contents of the <code>code/</code>
          directory and the dataset descriptions (e.g., README files) to be in Git.
          There are many other configurations, and you can also
              <a href="http://handbook.datalad.org/en/latest/basics/101-124-procedures.html" target="_blank">
                  write your own</a>.<br>
              <img  height="100px" src="../pics/yoda.png">
          </div>
      </dl>
  </section>
</section>

<section>
<section style="text-align: left;">
    <h3>...Computationally reproducible execution...</h3>

    Try to execute the downloaded analysis script. Does it work?
            <div><pre style="margin-left: 0;"><code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
cd ..
python code/classification_analysis.py</code></pre></div>

    <ul class="fragment">
        <li>
            Software can be difficult or impossible to install (e.g. conflicts with existing software,
            or on HPC) for you or your collaborators
        </li>
        <li>
            Different software versions/operating systems can produce different results:
            <a href="https://doi.org/10.3389/fninf.2015.00012" target="_blank">Glatard et al., doi.org/10.3389/fninf.2015.00012</a>
        </li>
        <li class="fragment fade-in">
            <strong>Software containers</strong> encapsulate a software environment and isolate it from
              a surrounding operating system. Two common solutions: Docker, Singularity
          </li>
       </ul>
</section>

<section style="text-align: left;">
    <h3>...Computationally reproducible execution...</h3>
    <ul>
        <li class="fragment fade-in-then-semi-out">The <code>datalad run</code>
            can run any command in a way that links the command or script to the
            results it produces and the data it was computed from</li>
        <li class="fragment fade-in-then-semi-out">The <code>datalad rerun</code>
            can take this recorded provenance and recompute the command</li>
        <li class="fragment fade-in-then-semi-out">The <code>datalad containers-run</code>
            (from the extension "datalad-container") can capture software provenance in the form of software containers in addition to the provenance that datalad run captures</li>
    </ul>
    <br><br>

</section>


<section style="text-align: left;">
    <h3>...Computationally reproducible execution</h3>

    <div class="fragment">
        With the <code>datalad-container</code> extension, we can add software containers
        to datasets and work with them.
        Let's add a software container with Python software to run the script
        <pre style="margin-left: 0;">
            <code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
               datalad containers-add python-env --url shub://adswa/resources:2
            </code>
        </pre>
    </div>


<div class="fragment">
        inspect the list of registered containers:
        <pre style="margin-left: 0;">
            <code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
                datalad containers-list
            </code>
        </pre>
    </div>

    <div class="fragment">
    Now, let's try out the <code>containers-run</code> command:
    <pre style="margin-left: 0;">
        <code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
datalad containers-run -m "run classification analysis in python environment" \
  --container-name python-env \
  --input "input/iris.csv" \
  --output "pairwise_relationships.png" \
  --output "prediction_report.csv" \
  "python3 code/classification_analysis.py {inputs} {outputs}"
        </code>
    </pre>
    </div>
    <div class="fragment">
        What changed after the <code>containers-run</code> command has completed?
        <br>
        We can use <code>datalad diff</code> (based on <code>git diff</code>):
        <pre style="margin-left: 0;">
            <code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
                datalad diff -f HEAD~1
            </code>
        </pre>
    </div>

    <div class="fragment">
        We see that some files were added to the dataset!
        <br>
        And we have a complete provenance record as part of the git history:
        <pre style="margin-left: 0;">
            <code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
                git log -n 1
            </code>
        </pre>
    </div>
</section>


<section>
    <h3 style="text-align: left;">Publishing datasets...</h3>
    <div style="margin-top:1em;">
        <table style="border: none;">
            <tr>
                <td><img style="width: 800px; margin-right:1px;margin-bottom:10px;vertical-align:middle;" data-src="../pics/comic_box6_publishing.svg" /></td>
                <td><img style="width: 1000px; margin-right:1px;margin-bottom:10px;vertical-align:middle;" data-src="../pics/comic_box9.svg" /></td>
            </tr>
        </table>
    </div>
    <br>
    <div class="fragment">We will use GIN: <a href="https://gin.g-node.org/" target="_blank">gin.g-node.org</a>:</div>
    <img class="fragment" src="../pics/artwork/src/publishing/startingpoint.svg">
</section>

<section>
    <h3 style="text-align: left;">Publishing datasets...</h3>
        <ul>
        <li>Create a GIN user account and log in:
            <a href="https://gin.g-node.org/user/sign_up" target="_blank">gin.g-node.org/user/sign_up</a> </li>
        <li>
            <a href="https://docs.github.com/en/authentication/connecting-to-github-with-ssh/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent?platform=linux" target="_blank">
                Create</a> an SSH key </li>
    <div>
        <pre style="margin-left: 0;">
            <code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
                ssh-keygen -t ed25519 -C "your-email"
                eval "$(ssh-agent -s)"
                ssh-add ~/.ssh/id_ed25519
            </code>
        </pre>
    </div>
        <li> <a href="https://handbook.datalad.org/en/latest/basics/101-139-gin.html#prerequisites" target="_blank">
                upload</a> the SSH key to GIN</li>
    <div>
        <pre style="margin-left: 0;">
            <code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
                cat ~/.ssh/id_ed25519.pub
            </code>
        </pre>
    </div>
            <img src="../pics/screenshot-gin3.png" height="400">
        <li>Publish your dataset!</li>
    </ul>

</section>


<section style="text-align: left;">
    <h3>...Publishing datasets</h3>

    DataLad has convenience functions to create <code>sibling</code>-repositories
    on various infrastructure and third party services (GitHub, GitLab, OSF, WebDAV-based services, DataVerse, ...)
    , to which data can then be published with <code>push</code>.
    <pre style="margin-left: 0;">
        <code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
            datalad create-sibling-gin example-analysis --access-protocol ssh
        </code>
    </pre>

    <div class="fragment">
        You can verify the dataset's siblings with the <code>siblings</code> command:
        <pre style="margin-left: 0;">
            <code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
                datalad siblings
            </code>
        </pre>
    </div>

    <div class="fragment">
        And we can push our complete dataset (Git repository and annex) to GIN:
        <pre style="margin-left: 0;">
            <code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
                datalad push --to gin
            </code>
        </pre>
    </div>
    <img class="fragment" src="../pics/in_case_of_fire.png" style="border:20px; margin:0px; float:center; width:500px;"/>
</section>


<section style="text-align: left;">
    <h3>Using published data...</h3>

    Let's see how the analysis feels like to others:
    <br><br>
    <pre style="margin-left: 0;">
        <code class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">cd ../
datalad clone \
   https://gin.g-node.org/adswa/example-analysis \
   myclone</code>
    </pre>

    <div class="fragment">
        <pre style="margin-left: 0;">
            <code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
                cd myclone
            </code>
        </pre>
    </div>

    <div class="fragment">
        Get results:
        <pre style="margin-left: 0;">
            <code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
                datalad get prediction_report.csv
            </code>
        </pre>
    </div>
    <div class="fragment">
        <pre style="margin-left: 0;">
            <code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
                datalad drop prediction_report.csv
            </code>
        </pre>
    </div>

    <div class="fragment">
        Or recompute results:
        <pre style="margin-left: 0;">
            <code data-trim class="language-bash" onmousemove="showHover(event)" onmousedown="clickCopy(event)" onmouseleave="leaveElement(event)">
                datalad rerun
            </code>
        </pre>
    </div>
</section>
</section>


<section>
    <section>
        <h2>How does this relate to reproducibility?</h2>
    </section>

<section data-transition="None">
    <h2>Exhaustive tracking</h2>
    <dl style="font-size:35px">
        <dt>The building blocks of a scientific result are rarely static</dt>
            <table>
                <tr>
                    <td style="vertical-align:middle">Data changes <br>
                        <small>(errors are fixed, data is extended,<br>
                            naming standards change, an analysis <br>
                                requires only a subset of your data...)</small></td>
                    <td><img src="../pics/phd052810s.png" height="500">
                    <imgcredit>Piled Higher and Deeper
                        <a href="https://phdcomics.com/comics/archive_print.php?comicid=1323" target="_blank">
                            1323
                        </a> </imgcredit></td>
                </tr>
            </table>
    </dl>
</section>


<section data-transition="None">
    <h2>Exhaustive tracking</h2>
    "Shit, which version of which script produced these outputs from which version
    of what data... and which software version?"<br>
    <img src="../pics/manuallabor.png">
    <img src="../pics/findfiles.png" height="400">
    <img src="../pics/projectstack.png" height="350">
    <imgcredit>CC-BY Scriberia and <a href="https://the-turing-way.netlify.app/reproducible-research/rdm.html" target="_blank">
        The Turing Way</a>
    </imgcredit>
</section>


<section data-transition="None">
    <h3>Exhaustive tracking</h3>
    Once you track changes to data with version control tools,
    you can find out <em>why</em> it changed, <em>what</em> has changed, <em>when</em> it changed,
    and <em>which version</em> of your data was used at which point in time.
    <div class="r-stack">
        <img height="450px" class="fragment fade-out" data-fragment-index="1" src="../pics/tigdata.png">
        <img height="450px" class="fragment" data-fragment-index="1" src="../pics/tigdata3.png">
        <img height="450px" class="fragment" src="../pics/tigdata2.png">
    </div>
</section>

  <section>
      <h2>Digital provenance</h2>
      <ul>
          <p >
              = <i>"The tools and processes used to create a
              digital file, the responsible entity, and when and where the process
              events occurred"</i>
          </p>
          <li class="fragment fade-in">
              Have you ever saved a PDF to read later onto your computer, but forgot
              where you got it from? Or did you ever find a figure in your project,
              but forgot which analysis step produced it?
          </li>
          <img src="../pics/Provenance_alpha.png">
          <imgcredit data-fragment-index="1" >Scriberia and <a href="https://the-turing-way.netlify.app">The Turing Way </a> (CC-BY)</imgcredit>
      </ul>
  </section>

    <section data-transition="None">
        <h3>Data transport: Security and reliability - for data</h3>
        Decentral version control for data integrates with a variety of services
        to let you store data in different places - creating a resilient network for data
        <img src="../pics/decentral_RDM_overview_left.png">
        <small> <a href="https://doi.org/10.1515/nf-2020-0037" target="_blank">"In defense of decentralized Research Data Management", doi.org/10.1515/nf-2020-0037</a> </small>
    </section>

    <section data-transition="None">
        <h3>Ultimate goal: Reusability</h3>
        Teamscience on more than code:
        <img src="../pics/teamscience.png">
        <img class="fragment" src="../pics/datahistory.png">
    </section>
</section>

<section>
    <section>
        <h3>The YODA principles</h3>
    </section>

    <section>
  <h2>DataLad Datasets for data analysis</h2>

  <ul style="font-size:30px">
      <li>A DataLad dataset can have <i>any</i> structure, and use as many or few
          features of a dataset as required.</li>

      <li>However, for <b>data analyses</b> it is beneficial to make
          use of DataLad features and structure datasets according to the <b>YODA principles</b>:</li>
  </ul>

  <img style="" data-src="../pics/yoda.png" height="200">
  <dl style="font-size:30px">
      <dt>P1: One thing, one dataset</dt>
      <dt>P2: Record where you got it from, and where it is now</dt>
      <dt>P3: Record what you did to it, and with what</dt>
  </dl><br><br<br>
                      <small>Find out more about the YODA principles in
          <a href="http://handbook.datalad.org/en/latest/basics/101-127-yoda.html" target="_blank">
              the handbook</a>, and more about structuring dataset at
          <a href="https://psychoinformatics-de.github.io/rdm-course/02-structuring-data/index.html#example-structure-yoda-principles" target="_blank">
              psychoinformatics-de.github.io/rdm-course/02-structuring-data</a>
                         </small>
    </section>

    <section data-markdown style="font-size:30px">
## P1: One thing, one dataset
![](../pics/dataset_modules.png)

- Create **modular** datasets: Whenever a particular collection of files could anyhow be useful in more
  than one context (e.g. data), put them in their own dataset, and install it as
  a subdataset.
- Keep everything **structured**: Bundle all components of one analysis into one superdataset, and
  within this dataset, separate code, data, output, execution environments.
- Keep a dataset **self-contained**, with relative paths in scripts to subdatasets or files.
  Do not use absolute paths.

</section>

<section style="font-size:30px" data-transition="None">
<h2>Why Modularity?</h2>
    <ul>
        <li>1. Reuse and access management</li>
        <li>2. Scalability</li>
        <li>3. Transparency</li><br>

Original:
<pre><code class="sh" style="max-height:none" data-trim>
/dataset
├── sample1
│   └── a001.dat
├── sample2
│   └── a001.dat
...
</code></pre>
<div class="fragment">
Without modularity, after applied transform (preprocessing, analysis, ...):
<pre><code class="sh" style="max-height:none" data-trim>
/dataset
├── sample1
│   ├── ps34t.dat
│   └── a001.dat
├── sample2
│   ├── ps34t.dat
│   └── a001.dat
...
</code></pre>
Without expert/domain knowledge, no distinction between original and derived data
    possible.
</div>
        </ul>
</section>


<section  style="font-size:30px" data-transition="None">
<h2>Why Modularity?</h2>
    <ul>
        <li>3. Transparency</li><br>

Original:
<pre><code class="sh" style="max-height:none" data-trim>
/raw_dataset
├── sample1
│   └── a001.dat
├── sample2
│   └── a001.dat
...
</code></pre>
        <strong>With modularity</strong> after applied transform (preprocessing, analysis, ...)
<pre><code class="sh" style="max-height:none" data-trim>
/derived_dataset
├── sample1
│   └── ps34t.dat
├── sample2
│   └── ps34t.dat
├── ...
└── inputs
    └── raw
        ├── sample1
        │   └── a001.dat
        ├── sample2
        │   └── a001.dat
        ...
</code></pre>
Clearer separation of semantics, through use of pristine version of original dataset within a
        <em>new, additional</em> dataset holding the outputs.</ul>
</section>


<section style="font-size:30px" data-transition="None" data-markdown><script type="text/template">
## When to modularize?

- Target audience is different
  - public vs. private
  - domain specific vs. domain general

- Pace of evolution is different
  - "factual" raw data vs. choices of (pre-)processing
  - completed acquisition vs. ongoing study

- Size impacts I/O and logistics
  - Git can struggle with 1M+ files
  - filesystems (licensing) can struggle with large numbers of inodes
  - More infos: [Go Big or Go Home chapter](http://handbook.datalad.org/en/latest/beyond_basics/basics-scaling.html)

- Legal/Access constraints
  - personal vs. anonymized data

<aside class="notes">
Note to self
</aside>
</script>
</section>

<section style="font-size:30px" data-markdown data-transition="None">
## P2: Record where you got it from, and where it is now
![](../pics/data_origin.png)

- **Link** individual datasets to declare data-dependencies (e.g. as subdatasets).
- **Record data's origin** with appropriate commands, for example
  to record access URLs for individual files obtained from (unstructured) sources "in the cloud".
- Share and **publish** datasets for collaboration or back-up.

</section>


<section data-transition="None" style="font-size:30px">
<h2>Dataset linkage</h2>
<img data-src="../pics/dataset_linkage.png">
<pre><code class="bash" style="font-size:115%;max-height:none">$ datalad clone --dataset . http://example.com/ds inputs/rawdata
</code></pre>

<pre><code class="diff" style="max-height:none">$ git diff HEAD~1
diff --git a/.gitmodules b/.gitmodules
new file mode 100644
index 0000000..c3370ba
--- /dev/null
+++ b/.gitmodules
@@ -0,0 +1,3 @@
+[submodule "inputs/rawdata"]
+       path = inputs/rawdata
+       url = http://example.com/importantds
diff --git a/inputs/rawdata b/inputs/rawdata
new file mode 160000
index 0000000..fabf852
--- /dev/null
+++ b/inputs/rawdata
@@ -0,0 +1 @@
+Subproject commit fabf8521130a13986bd6493cb33a70e580ce8572
</code></pre>
Each (sub)dataset is a separately, but jointly version-controlled entity.
    If none of its data is retrieved, subdatasets are an extremely <strong>lightweight</strong> data dependency
    and yet <strong>actionable</strong> (<strong>datalad get</strong> retrieves contents on demand)
    <aside class="notes">weighs just a few bytes</aside>
</section>

    <section data-markdown style="font-size:30px">
## P3: Record what you did to it, and with what
![](../pics/dataset_linkage_provenance.png)

- Collect and store **provenance** of all contents of a dataset that you create
- "Which script produced which output?", "From which data?", "In which **software environment**?"
  ... Record it in an ideally machine-readable way with **datalad (containers-)run**

</section>
</section>

<section>
    <section>
        <h3>Take home messages</h3>
        <dl>
            <dt class="fragment fade-in-then-semi-out" data-fragment-index="1">Data deserves version control</dt>
            <dd class="fragment fade-in-then-semi-out" data-fragment-index="1">
                It changes and evolves just like code, and exhaustive tracking lays a foundation for reproducibility</dd>
            <dt class="fragment fade-in-then-semi-out" data-fragment-index="2">
                Reproducible science relies on good data management
            </dt>
            <dd class="fragment fade-in-then-semi-out" data-fragment-index="2">
                But effort pays off: Increased transparency, better reproducibility, easier accessibility,
                efficiency through automation and collaboration, streamlined procedures for synchronizing and updating your work, ...</dd>
            <dt  class="fragment fade-in-then-semi-out" data-fragment-index="3">DataLad can help with some things</dt>
            <dd  class="fragment fade-in-then-semi-out" data-fragment-index="3">
                Have access to more data than you have disk space</dd>
            <dd  class="fragment fade-in-then-semi-out" data-fragment-index="3">
                Who needs short-term memory when you can have automatic provenance capture?
            </dd>
            <dd  class="fragment fade-in-then-semi-out" data-fragment-index="3">
                Link versioned data to your analysis at no disk-space cost</dd>
            <dd  class="fragment fade-in-then-semi-out" data-fragment-index="3">...</dd>
        </dl>
    </section>
</section>

<section>

<section>
   <h3>Scalability</h3>
</section>

<section data-markdown data-transition="None"><script type="text/template">
## FAIRly big: Scaling up

Objective: Process the UK Biobank (imaging data)
![](../pics/biobank_website.png)<!-- .element: height="400" -->

- 76 TB in 43 million files in total
- 42,715 participants contributed personal health data
- Strict DUA
- Custom binary-only downloader
- Most data records offered as (unversioned) ZIP files
</script></section>

<section data-markdown data-transition="None"><script type="text/template">
## Challenges

- Process data such that
  - Results are computationally reproducible (without the original compute infrastructure)
  - There is complete linkage from results to an individual data record download
  - It scales with the amount of available compute resources

- Data processing pipeline
  - Compiled MATLAB blob
  - 1h processing time per image, with 41k images to process
  - 1.2 M output files (30 output files per input file)
  - 1.2 TB total size of outputs
</script></section>

<section data-transition="None">
    <h2> FAIRly big setup</h2>
<img src="../pics/fairlybig_ukbsetup.png" width="1200" style="margin-top:-35px;margin-bottom:-30px">

    <ul style="font-size:30px">
        <strong>Exhaustive tracking</strong>
        <li><a href="https://github.com/datalad/datalad-ukbiobank" target="_blank">datalad-ukbiobank</a>
extension downloads, transforms & track the evolution of the complete data release
            in DataLad datasets
</li>
        <li>Native and BIDSified data layout (at no additional disk space usage)</li>
        <li>Structured in 42k individual datasets, combined to one superdataset</li>
        <li>Containerized pipeline in a software container</li>
        <li>Link input data & computational pipeline as dependencies</li>
    </ul>
<br><br>
<small><a href="https://www.nature.com/articles/s41597-022-01163-2" target="_blank">
    Wagner, Waite, Wierzba et al. (2021). FAIRly big: A framework for computationally reproducible processing of large-scale data.</a>
</small>
</section>

<section  data-transition="None">
    <h2>FAIRly big workflow</h2>
    <div class="r-stack">
<img class="fragment fade-out" src="../pics/fairlybig_workflow.png" width="1200" style="margin-top:-35px;margin-bottom:-30px">
<img src="../pics/htcondor.svg" class="fragment fade-in">
    </div>
        <br>
    <ul style="font-size:30px">
        <strong>portability</strong>
    <li>Parallel processing: 1 job = 1 subject
        (number of concurrent jobs capped at the capacity of the compute cluster)
    </li>
    <li>Each job is computed in a ephemeral (short-lived) dataset clone, results are pushed back:
        Ensure exhaustive tracking &
        portability during computation</li>
    <li>Content-agnostic persistent (encrypted) storage (minimizing storage and inodes)</li>
    <li>Common data representation in secure environments</li>
</ul>
    <br><br>
<small><a href="https://www.nature.com/articles/s41597-022-01163-2" target="_blank">
    Wagner, Waite, Wierzba et al. (2021). FAIRly big: A framework for computationally reproducible processing of large-scale data.</a>
</small></section>


<section data-transition="None">
    <h2>FAIRly big provenance capture</h2>
<img src="../pics/fairlybig_prov.png" width="1200" style="margin-top:-35px;margin-bottom:-30px">
<br><br>
    <ul style="font-size:30px">
        <strong>Provenance</strong>
    <li>Every single pipeline execution is tracked</li>
    <li>Execution in ephemeral workspaces ensures results
        individually reproducible without HPC access</li>
</ul>
<br><br>
<small><a href="https://www.nature.com/articles/s41597-022-01163-2" target="_blank">
    Wagner, Waite, Wierzba et al. (2021). FAIRly big: A framework for computationally reproducible processing of large-scale data.</a>
</small></section>

<section data-markdown><script type="text/template">
## FAIRly big movie

<iframe width="1120" height="630" src="https://www.youtube-nocookie.com/embed/UsW6xN2f2jc?start=17" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

- Two computations on clusters of different scale (small cluster, supercomputer). Full video: https://youtube.com/datalad
- Two full (re-)computations, programmatically comparable, verifiable, reproducible -- on any system with data access
</script></section>
</section>

<section>
<section>
    <h2>Thank you for your attention!</h2>

    <img src="../pics/qr_hidarepro.png" height="400">
<br><br><small>

        Slides: <a href="https://doi.org/10.5281/zenodo.10118794" target="_blank">
    DOI 10.5281/zenodo.10118794</a> (Scan the QR code)
    <br><br>
    </small>
    <table>
        <tr>
        </tr>
        <tr style="vertical-align:middle">
         <td style="vertical-align:middle">
             <img src="../pics/winrepo.png">
         </td>
            <td style="font-size: 18px">
                <br><br>
                Women neuroscientists are <a href="https://onlinelibrary.wiley.com/doi/full/10.1111/ejn.14397" target="_blank">
                underrepresented in neuroscience</a>. You can use the <br>
                <a href="https://www.winrepo.org/" target="_blank"> Repository for Women in Neuroscience</a> to find
                and recommend neuroscientists for <br>
                conferences, symposia or collaborations, and help making neuroscience more open & divers.
            </td>
        </tr>

    </table>
</section>

</section>


<section>
    <section>
        <h3>Command summaries</h3>
    </section>

    <section>
      <h3>Summary - Local version control</h3>

  <dl>
        <dt class="fragment fade-in"><code>datalad create</code> creates an empty dataset.</dt>
      <dd class="fragment fade-in">Configurations (<b>-c yoda</b>, <b>-c text2git</b>)
          add useful structure and/or configurations.</dd>
        <br>
        <dt class="fragment fade-in">A dataset has a <i>history</i> to track files and their modifications. </dt><dd class="fragment fade-in">Explore it with Git (<b>git log</b>) or external tools (e.g., <b>tig</b>).</dd>
        <br>
        <dt class="fragment fade-in"><code>datalad save</code> records the dataset or file state to the history. </dt><dd class="fragment fade-in">Concise <b>commit messages</b> should summarize the change for future you and others.</dd>
        <br>
        <dt class="fragment fade-in"><code>datalad download-url</code> obtains web content and records its origin. </dt><dd class="fragment fade-in">It even takes care of saving the change.</dd>
        <br>
        <dt class="fragment fade-in"><code>datalad status</code> reports the current state of the dataset.</dt>
      <dd class="fragment fade-in">A clean dataset status (no modifications, not untracked files) is good practice.</dd>
      </dl>
  </section>

  <section>
      <h3>Summary - Dataset consumption & nesting</h3>

      <ul>
        <dt class="fragment fade-in"><code>datalad clone</code> installs a dataset.</dt><dd class="fragment fade-in"> It can be installed “on its own”:
        Specify the source (url, path, ...) of the dataset, and an optional <b>path</b> for it to be installed to.</dd>
        <br>
        <dt class="fragment fade-in">Datasets can be installed as subdatasets within an existing dataset. </dt> <dd class="fragment fade-in"> The <b>--dataset/-d</b> option needs a path to the root of the superdataset.</dd>
        <br>
        <dt class="fragment fade-in">Only small files and metadata about file availability are present locally after an install. </dt>
          <dd class="fragment fade-in">To retrieve actual file content of annexed files,
              <code>datalad get </code> downloads file content on demand.</dd>
        <br>
        <dt class="fragment fade-in">Datasets preserve their history.</dt> <dd class="fragment fade-in">The superdataset records only the <i>version state</i> of the subdataset.</dd>

      </ul>
  </section>


  <section>
      <h3>Summary - Reproducible execution</h3>

      <ul>
        <dt class="fragment fade-in"><code>datalad run</code> records a command and
            its impact on the dataset.</dt>
          <dd class="fragment fade-in">All dataset modifications are saved - use it
              in a clean dataset.</dd>
        <br>
        <dt class="fragment fade-in">Data/directories specified as <code>--input</code>
            are retrieved prior to command execution.</dt>
          <dd class="fragment fade-in"> Use one flag per input.</dd>
        <br>
        <dt class="fragment fade-in">Data/directories specified as <code>--output</code>
            will be unlocked for modifications prior to a rerun of the command. </dt>
          <dd class="fragment fade-in">Its optional to specify, but helpful for recomputations.</dd>
        <br>
        <dt class="fragment fade-in"><code>datalad containers-run</code> can be used
            to capture the software environment as provenance.</dt>
          <dd class="fragment fade-in">Its ensures computations are ran in the desired software set up.
              Supports Docker and Singularity containers</dd>
        <br>
        <dt class="fragment fade-in"><code>datalad rerun</code> can automatically re-execute run-records later.</dt>
          <dd class="fragment fade-in">They can be identified with any commit-ish (hash, tag, range, ...)</dd>

      </ul>
  </section>


</section>


			</div>
		</div>

		<script src="../reveal.js/dist/reveal.js"></script>
		<script src="../reveal.js/plugin/notes/notes.js"></script>
		<script src="../reveal.js/plugin/markdown/markdown.js"></script>
		<script src="../reveal.js/plugin/highlight/highlight.js"></script>
        <script src="../custom_functions.js"></script>
		<script>
			// More info about initialization & config:
			// - https://revealjs.com/initialization/
			// - https://revealjs.com/config/
			Reveal.initialize({
				hash: true,
				// The "normal" size of the presentation, aspect ratio will be preserved
				// when the presentation is scaled to fit different resolutions. Can be
				// specified using percentage units.
				width: 1280,
				height: 960,
				// Factor of the display size that should remain empty around the content
				margin: 0.3,
				// Bounds for smallest/largest possible scale to apply to content
				minScale: 0.2,
				maxScale: 1.0,

				controls: true,
				progress: true,
				history: true,
				center: true,
				slideNumber: 'c',
				pdfSeparateFragments: false,
				pdfMaxPagesPerSlide: 1,
				pdfPageHeightOffset: -1,
				transition: 'slide', // none/fade/slide/convex/concave/zoom
				// Learn about plugins: https://revealjs.com/plugins/
				plugins: [ RevealMarkdown, RevealHighlight, RevealNotes ]
			});
		</script>
	</body>
</html>