datalad-course/html/uke_basics.html

<!doctype html>
<html>
	<head>
		<meta charset="utf-8">
		<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no">

		<!-- Edit me start! -->
		<title>This is where your title goes</title>
		<meta name="description" content=" This is where you put a short description ">
		<meta name="author" content=" Your Name ">
		<!-- Edit me end! -->

		<link rel="stylesheet" href="../reveal.js/dist/reset.css">
		<link rel="stylesheet" href="../reveal.js/dist/reveal.css">
		<link rel="stylesheet" href="../reveal.js/dist/theme/beige.css">

		<!-- Theme used for syntax highlighted code -->
		<link rel="stylesheet" href="../reveal.js/plugin/highlight/monokai.css">
	</head>
	<body>
		<div class="reveal">
			<div class="slides">

  <!--...Datalad Basics...-->

  <section>
<section>
<script src="https://cdn.logwork.com/widget/countdown.js"></script>
<a href="https://logwork.com/countdown-2zu8" class="countdown-timer"
   data-style="columns" data-timezone="Europe/Berlin" data-date="2022-04-21 09:45">
   "Motivation & Basics of version control" starts in </a>
</section>
</section>

<section>
  <section>
      <h2>Participation modes </h2>
      <iframe src="https://www.directpoll.com/r?XDbzPBd3ixYqg8huKIwKuJ7aj5lQw7fByQ4HgMgN",
              style="border: 0" width="800" height="800"></iframe>
  </section>


  <section>
    <h2>Prerequisites: Installation and Configuration</h2>
          <ul style="font-size:30px">
          <li  data-fragment-index="1" class="fragment fade-in">Your installed version of DataLad should be 0.16.1</li>
          <pre class="fragment fade-in" data-fragment-index="1"><code data-fragment-index="1" class="fragment fade-in">datalad --version
0.16.1</code></pre>

          <li data-fragment-index="2" class="fragment fade-in">DataLad relies on Git to create a revision history with detailed information on
              what was changes, when, and how. Therefore, you should tell Git who you are and
              configure a Git identity (name and email)</li>
          <pre data-fragment-index="2" class="fragment fade-in"><code data-fragment-index="2" class="fragment fade-in" class="bash">$ git config --list
user.name=Adina Wagner
user.email=adina.wagner@t-online.de
[...]
</code></pre>
      <li data-fragment-index="3" class="fragment fade-in">Set a Git identity using
          <pre data-fragment-index="3" class="fragment fade-in"><code data-fragment-index="3" class="fragment fade-in">$ git config set --global user.name "Adina Wagner"
$ git config set --global user.email "adina.wagner@t-online.de"</code></pre>
          Find installation and configuration
      instructions at <a href="http://handbook.datalad.org/en/latest/intro/installation.html" target="_blank">
              handbook.datalad.org</a> </li></ul>
  </section>

  <section data-transition="None">
      <h2>Using DataLad</h2>
      <ul>
          <div>
          <li>DataLad can be used from the command line</li>
          <pre><code>datalad create mydataset</code></pre></div>
          <div>
          <li>... or with its Python API</li>
          <pre><code class="python">import datalad.api as dl
dl.create(path="mydataset")</code></pre></div>
          <div class="fragment fade-in">
          <li>... and other programming languages can use it via system call</li>
          <pre><code class="python"># in R
> system("datalad create mydataset")
</code></pre></div>
      </ul>
  </section>

  <section data-transition="None">
      <h2>Using DataLad</h2>
      <ul style="font-size:30px">
          <li class="fragment fade-in">Every DataLad command consists of a main
              command followed by a sub-command. The main and the sub-command can have options.
              <img src="../pics/command-structure.png">
          </li>
          <li class="fragment fade-in"> Example (main command, subcommand, several subcommand options):
              <pre><code>$ datalad save -m "Saving changes" --recursive </code></pre>
          </li>
          <li class="fragment fade-in">Use <em>--help</em> to find out more about any (sub)command
              and its options, including detailed description and examples (<em>q</em> to close). Use <em>-h</em> to get a short
          overview of all options
          <pre><code>$ datalad save -h
      Usage: datalad save [-h] [-m MESSAGE] [-d DATASET] [-t ID] [-r] [-R LEVELS]
                    [-u] [-F MESSAGE_FILE] [--to-git] [-J NJOBS] [--amend]
                    [--version]
                    [PATH ...]

Use '--help' to get more comprehensive information.
          </code></pre></li>
      </ul>
  </section>


  <section>
      <h2>DataLad Datasets</h2>

      <ul>
          <li>DataLad's core data structure</li>
          <ul>
              <li>Dataset = A directory managed by DataLad</li>
              <li>Any directory of your computer can be managed by DataLad.</li>
              <li class="fragment fade-in">Datasets can be <i>created</i> (from scratch) or <i>installed</i></li>
              <li class="fragment fade-in">Datasets can be nested: <i>linked subdirectories</i></li>
          </ul>
          <li class="fragment fade-in">Let's start by creating a dataset:</li>
          <div class="fragment fade-in"><pre><code>$ datalad create -c text2git my-dataset</code></pre></div>
      </ul>
      <a class="fragment fade-in" style="font-size:25px" href="https://psychoinformatics-de.github.io/rdm-course/01-content-tracking-with-datalad/index.html#getting-started-create-an-empty-dataset" target="_blank">
          Code: psychoinformatics-de.github.io/rdm-course/01-content-tracking-with-datalad/index.html#getting-started-create-an-empty-dataset
      </a>
  <aside class="notes">
      <li>anything can be managed: CV, website, music library, phd</li>
      <li>show this on the manuscript repo: history, looks/feels</li>
  </aside>
  </section>

<section data-transition="None">
    <h2>DataLad Datasets</h2>
    A DataLad dataset is a joined Git + git-annex repository
    <img src="../pics/slides/pics/datalad_sandwhich_tuned/sandwhich03.svg">
</section>
</section>

<section>
  <section data-transition="None">
      <h3>What is version control?</h3>
      <img height="400" src="../pics/turingway/VersionControl.svg">
      <img height="400" src="../pics/turingway/ProjectHistory.svg">
      <imgcredit>Illustration adapted from Scriberia and The Turing Way</imgcredit>
      <ul>
          <li class="fragment fade-in">keep things organized</li>
          <li class="fragment fade-in">keep track of changes</li>
          <li class="fragment fade-in">revert changes or go back to previous states</li>
      </ul>
  </section>

  <section data-transition="None">
      <h2>Why version control?</h2>
      <img src="../pics/final.png" style="box-shadow: 10px 10px 8px #888888;height=600px" height="600"><br>
  </aside>
  </section>

  <section>
      <h2>Version Control</h2>

      <ul>
          <li>DataLad knows two things: Datasets and files</li>
          <img class="fragment fade-in" data-fragment-index="1" style="box-shadow: 5px 5px 3px #888888" src="../pics/artwork/src/dataset.svg" height="330"> <img style="box-shadow: 5px 5px 3px #888888" height="330" class="fragment fade-in" data-fragment-index="2" src="../pics/artwork/src/local_wf.svg">
       </ul><br>
      <li class="fragment fade-in">
          Every file you put into a in a dataset can be easily version-controlled,
          regardless of size, with the same command: <em>datalad save</em> </li>
  </section>


  <section>
      <h2>Local version control</h2>

      <p>Procedurally, version control is easy with DataLad!</p>
      <img class="fragment fade-in" src="../pics/local_wf.svg" height="500"> <!-- .element: class="fragment" -->
      <br>

      <b class="fragment fade-in">Advice:</b>
      <ul>
        <li class="fragment fade-in">Save <i>meaningful</i> units of change</li>
        <li class="fragment fade-in">Attach helpful commit messages</li>
      </ul>
  </section>

  <section data-markdown><script type="text/template" >

  ### This means: You can also version control data! <!-- .element: class="fragment" -->

  <pre><code class="bash" style="max-height:none">$ datalad save \
     -m "Adding raw data from neuroimaging study 1" \
     sub-*
  add(ok): sub-1/anat/T1w.json (file)
  add(ok): sub-1/anat/T1w.nii.gz (file)
  add(ok): sub-1/anat/T2w.json (file)
  add(ok): sub-1/anat/T2w.nii.gz (file)
  add(ok): sub-1/func/sub-1-run-1_bold.json (file)
  add(ok): sub-1/func/sub-1-run-1_bold.nii.gz (file)
  add(ok): sub-10/anat/T1w.json (file)
  add(ok): sub-10/anat/T1w.nii.gz (file)
  add(ok): sub-10/anat/T2w.json (file)
  add(ok): sub-10/anat/T2w.nii.gz (file)
    [110 similar messages have been suppressed]
  save(ok): . (dataset)
  action summary:
    add (ok: 120)
    save (ok: 1)
  </code></pre>  <!-- .element: class="fragment" -->

  </script>
  </section>

  <section data-markdown><script type="text/template" >
  ## Version Control
  * Your dataset can be a complete research log, capturing everything that was done, when, by whom, and how
  ![](../pics/researchlog.png)
  * Interact with the history:
    * reset your dataset (or subset of it) to a previous state,
    * throw out changes or bring them back,
    * find out what was done when, how, why, and by whom
    * Identify precise versions: Use data in the most recent version, or the one from 2018, or...
    * ...
  </script>
  </section>


  <section>
      <h2>Preview: Start to record provenance</h2>
      <ul>
          <li>
              Have you ever saved a PDF to read later onto your computer, but forgot
              where you got it from?
          </li>
          <li class="fragment fade-in">
              Digital Provenance = <i>"The tools and processes used to create a
              digital file, the responsible entity, and when and where the process
              events occurred"</i>
          </li>
          <li class="fragment fade-in">
              The history of a dataset already contains provenance, but there is more
              to record - for example: Where does a file come from?
              <code>datalad download-url</code> is helpful
          </li>
      </ul>
  </section>

    <section>
      <h3>Summary - Local version control</h3>

  <dl>
        <dt class="fragment fade-in"><code>datalad create</code> creates an empty dataset.</dt> <dd class="fragment fade-in">Configurations (<b>-c yoda</b>, <b>-c text2git</b>) are useful (details soon).</dd>
        <br>
        <dt class="fragment fade-in">A dataset has a <i>history</i> to track files and their modifications. </dt><dd class="fragment fade-in">Explore it with Git (<b>git log</b>) or external tools (e.g., <b>tig</b>).</dd>
        <br>
        <dt class="fragment fade-in"><code>datalad save</code> records the dataset or file state to the history. </dt><dd class="fragment fade-in">Concise <b>commit messages</b> should summarize the change for future you and others.</dd>
        <br>
        <dt class="fragment fade-in"><code>datalad download-url</code> obtains web content and records its origin. </dt><dd class="fragment fade-in">It even takes care of saving the change.</dd>
        <br>
        <dt class="fragment fade-in"><code>datalad status</code> reports the current state of the dataset.</dt>
      <dd class="fragment fade-in">A clean dataset status (no modifications, not untracked files) is good practice.</dd>
      </dl>
  </section>


  <section>
      <h2>Questions!</h2>
      <small>Awkward silence can be bridged with awkward MC questions :) </small>
          <iframe src="https://www.directpoll.com/r?XDbzPBd3ixYqg8huKIwKuJ7aj5lQw7fByQ4HgMgN",
              style="border: 0", width="930", height="900"></iframe>
  </section>
  </section>

<section>
    <section>
        <h2>Teaser: Time-travelling</h2>
        <small>Comprehensive walk-through<a href="http://handbook.datalad.org/en/lastest/basics/101-137-history.html" target="_blank">
            handbook.datalad.org/basics/101-137-history.html
        </a></small>
        <ul style="font-size:30px">
            <li>Mistakes are not forever anymore: Past changes can transparently be undone</li>
            <li>Become a time-bender: Travel back in time or rewrite history</li>
            <li class="fragment fade-in">Prerequisite: Understand Git IDs and "refs"</li>
            <ul>
                <li class="fragment fade-in">Commit hash/Commit SHA: A 40-character string identifying each commit</li>
                <li class="fragment fade-in">Branch names, e.g., <em>main</em></li>
                <li class="fragment fade-in">Tags, e.g., <em>v.0.1</em></li>
                <li class="fragment fade-in">A pointer to the checked-out (current) commit on the current branch, <em>HEAD</em></li>
            </ul>
        </ul>
        <img class="fragment fade-in" src="../pics/commit-ref.png"><br>
              <a class="fragment fade-in" style="font-size:25px" href="https://psychoinformatics-de.github.io/rdm-course/01-content-tracking-with-datalad/index.html#getting-started-create-an-empty-dataset" target="_blank">
          Code: psychoinformatics-de.github.io/rdm-course/01-content-tracking-with-datalad/index.html#breaking-things-and-repairing-them
      </a>
    </section>

    <section>
        <h2>Summary: Interacting with Git's history (teaser)</h2>
  <dl>
      <dt class="fragment fade-in">Interactions with Git's history require Git commands, but are immensely powerful</dt><dd class="fragment fade-in">More in <a href="http://handbook.datalad.org/en/latest/basics/101-137-history.html" target="_blank">
            handbook.datalad.org/basics/101-137-history.html
        </a></dd>
       <br>
      <dt class="fragment fade-in"><code>git restore</code> is a dangerous (!), but sometimes useful command:</dt>
      <dd class="fragment fade-in"> It removes unsaved modifications to restore files to a past, saved state. What has been removed by it can not be brought back to life!</dd>
        <br>
        <dt class="fragment fade-in"><code>git revert [hash]</code> transparently undoes a past commit</dt><dd class="fragment fade-in">It will create a new entry in the revision history about this.</dd>
        <br>
        <dt class="fragment fade-in">Commands that will be introduced later:</dt>
        <dd class="fragment fade-in"><code>git checkout</code> lets you time-travel.</dd>
        <dt class="fragment fade-in">Commands that are out of scope but useful to know:</dt>
        <dd class="fragment fade-in"><code>git rebase</code> changes and <code>git reset</code> rewinds history without creating a commit about it (see Handbook chapter for examples).</dd>
        <dt class="fragment fade-in">A life-saver that is not well-known: <code>git reflog</code></dt><dd class="fragment fade-in">A time-limited backlog of every past performed action, can undo every mistake except <code>git restore</code> and <code>git clean</code>.</dd>
       </dl>
    </section>

  <section>
      <h2>Questions!</h2>
      <small>Awkward silence can be bridged with awkward MC questions :) </small>
          <iframe src="https://www.directpoll.com/r?XDbzPBd3ixYqg8huKIwKuJ7aj5lQw7fByQ4HgMgN",
              style="border: 0", width="930", height="900"></iframe>
  </section>
</section>

  <section>
<section>
<h2>A look underneath the hood</h2>
    <h4>(In-depth explanations how and why things work, with plenty of teasers to additional features)</h4>
</section>

         <section data-transition="None" style="vertical-align:top">
        <h3>There are two version control tools at work - why?</h3>
        <p class="fragment fade-in">Git does not handle large files well.
            <div class="r-stack">
            <img class="fragment" src="../pics/gitsnapshot.png">
        </div>
        </p>
    </section>

    <section data-transition="None">
        <h3>There are two version control tools at work - why?</h3>
        <p>Git does not handle large files well.
            <img src="../pics/gitsnapshot2.png">
        </p>
        <p class="fragment fade-in">
        And repository hosting services refuse to handle large files:
        <img src="../pics/pushing_large_files_to_Git.png"></p>
        <p style="z-index: 100;position: fixed; font-size:35px;margin-top:-450px;margin-bottom:300px;margin-left:1000px">
            <img class="fragment" src="../pics/horrofied.png" height="380px"></p>
        <p class="fragment fade-in">git-annex to the rescue! Let's take a look how it works</p>
    </section>

  <section data-markdown><script type="text/template" >
  ## Consuming datasets
  * A dataset can be created from scratch/existing directories:
  <pre><code class="bash" style="max-height:none">$ datalad create mydataset
  [INFO] Creating a new annex repo at /home/adina/mydataset
  create(ok): /home/adina/mydataset (dataset)
  </code></pre>
  * but datasets can also be installed from paths or from URLs:
  <pre><code class="bash" style="max-height:none">$ datalad clone https://github.com/datalad-datasets/human-connectome-project-openaccess HCP
  install(ok): /tmp/HCP (dataset)
  </code></pre>
            <small>Hint: Did you know that you can get the <a href="https://github.com/datalad-datasets/human-connectome-project-openaccess" target="_blank"> Human Connectome Project Open Access Data </a> as a Dataset?</small>
  </script>
 </section>

  <section data-transition="None">
      <h2>Consuming datasets</h2>

    <ul>
      <li class="fragment fade-in">Here's how to get a dataset:</li>
        <img class="fragment fade-in" src="../pics/clonedata.gif" height="700">

    </ul>
  </section>
  <section data-transition="None">
      <h2>Consuming datasets</h2>

    <ul>
      <li>Here's how a dataset looks after installation:</li>
        <img class="fragment fade-in" src="../pics/getdata.gif" height="700">

    </ul>
  </section>

  <section data-transition="None">
      <h2>Plenty of data, but little disk-usage</h2>
      <ul>
          <li class="fragment fade-in-then-semi-out">Cloned datasets are lean.
              "Meta data" (file names, availability) are present, but <b>no file content</b>:</li>
  <pre class="fragment fade-in"><code>$ datalad clone git@github.com:psychoinformatics-de/studyforrest-data-phase2.git
  install(ok): /tmp/studyforrest-data-phase2 (dataset)
  $ cd studyforrest-data-phase2 && du -sh
  18M	.</code></pre>

  <li class="fragment fade-in-then-semi-out"> files' contents can be retrieved on demand:</li>
      </ul>
  <pre class="fragment fade-in"><code>$ datalad get sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz
  get(ok): /tmp/studyforrest-data-phase2/sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz (file) [from mddatasrc...]</code></pre>

  <li class="fragment fade-in">Have more access to your computer than you have disk-space:</li>
  <pre class="fragment fade-in"><code># eNKI dataset (1.5TB, 34k files):
$ du -sh
1.5G	.
# HCP dataset (~200TB, >15 million files)
$ du -sh
48G	. </code></pre>
  </section>

  <section data-markdown data-transition="None"> <script type="text/template">
  ## Plenty of data, but little disk-usage

  Drop file content that is not needed:<!-- .element: class="fragment fade-in" -->
  <pre class="fragment fade-in-then-semi-out"><code>$ datalad drop sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz
  drop(ok): /tmp/studyforrest-data-phase2/sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz (file) [checking https://arxiv.org/pdf/0904.3664v1.pdf...]</code></pre>
  When files are dropped, only "meta data" stays behind, and they can be re-obtained on demand.<!-- .element: class="fragment fade-in" -->
<pre><code class="python">dl.get('input/sub-01')
    [really complex analysis]
    dl.drop('input/sub-01')
</code></pre><!-- .element: class="fragment fade-in" -->
  </script></section>

  <section>
      <h2>Git versus Git-annex</h2>
      <dl>
          <dt>Data in datasets is either stored in Git or git-annex</dt>
          <dd>By default, everything is <i>annexed</i>, i.e., stored in a dataset annex by git-annex</dd><br>

          <br>
                    <small>
          <table>
              <tr>
                  <td><b>Git</b></td>
                  <td><b>git-annex</b></td>
              </tr>
              <tr>
                  <td>handles <b>small</b> files well (text, code)</td>
                  <td>handles <b>all</b> types and sizes of files well</td>
              </tr>
              <tr>
                  <td>file contents are in the Git history
                      and will be <b>shared</b> upon git/datalad push</td>
                  <td>file contents are in the annex. Not necessarily shared</td>
              </tr>
              <tr>
                  <td>Shared with every dataset clone</td>
                  <td><b>Can be kept private</b> on a per-file level when sharing the dataset</td>
              </tr>
              <tr>
                  <td>Useful: Small, non-binary, frequently modified, need-to-be-accessible (DUA, README) files </td>
                  <td>Useful: Large files, private files</td>
              </tr>
          </table>
              </small>
          <br><br>
      </dl>
  </section>

  <section>
      <h2>Git versus Git-annex</h2>
      <small>Useful background information for demo later. Read
          <a href="http://handbook.datalad.org/en/latest/basics/101-115-symlinks.html" target="_blank">
          this handbook chapter</a> for details
      </a> </small><br>
      Git and Git-annex handle files differently: annexed files are stored in an annex.
      File content is hashed & only content-identity is committed to Git.
      <ul>
        <table>
            <tr>
                <td>
                    <li>Files stored in Git are modifiable, files stored in Git-annex are content-locked</li>
                </td>
                <td width="60%">
                    <img src="../pics/git_vs_gitannex.svg" height="500">
                </td>
            </tr>
                  </table>

         <li>Annexed contents are not available right after cloning,
             only content identity and availability information (as they are stored in Git).
             Everything that is annexed needs to be retrieved with <code>datalad get</code> from whereever it is stored.
         </li>
      </ul>
  </section>

  <section>
      <h2>Git versus Git-annex</h2>
      <img height="500" src="../pics/artwork/src/publishing/publishing_gitvsannex.svg">
  </section>

  <section>
      <h2>Git versus Git-annex</h2>
      <ul>
          When sharing datasets with someone without access to the same computational
          infrastructure, annexed data is not necessarily stored together with the rest
          of the dataset (more in the <b>session on publishing</b>).
      </ul>
      <img src="../pics/services_connected.png" height="500">
      <ul>
          Transport logistics exist to interface with all major storage providers.
          If the one you use isn't supported, let us know!
      </ul>
  </section>


  <section>
      <h2>Git versus Git-annex</h2>
      <ul>
          Users can decide which files are annexed:
          <br><br>
          <li><b>Pre-made run-procedures</b>, provided by DataLad (e.g., <code>text2git</code>, <code>yoda</code>)
              or created and shared by users
              (<a href="http://handbook.datalad.org/en/latest/basics/101-124-procedures.html" target="_blank">Tutorial</a>) </li>
          <li>Self-made configurations in <code>.gitattributes</code> (e.g., based on file type,
              file/path name, size, ...; <a href="http://handbook.datalad.org/en/latest/basics/101-123-config2.html#gitattributes" target="_blank">
                  rules and examples
              </a> )</li>
          <li>Per-command basis (e.g., via <code>datalad save --to-git</code>)</li>
      </ul>
  </section>


  <section>
      <h2><em>text2git</em>Text versus binary files</h2>
      <iframe src="https://www.directpoll.com/r?XDbzPBd3ixYqg8huKIwKuJ7aj5lQw7fByQ4HgMgN",
              style="border: 0", width="930", height="900"></iframe>
      <small>An overview of text- versus binary files and implications for version control is in
      <a href="https://psychoinformatics-de.github.io/rdm-course/02-structuring-data/index.html#file-types-text-vs-binary" target="_blank">
          psychoinformatics-de.github.io/rdm-course/02-structuring-data/index.html#file-types-text-vs-binary
      </a> </small>
  </section>


  <section data-transition="None">
      <h2>Disk-space aware workflows</h2>
      <ul>
          <li class="fragment fade-in-then-semi-out"> Clone the input data:</li>
                  <pre class="fragment fade-in"><code>$ datalad clone git@github.com:datalad-datasets/machinelearning-books.git
install(ok): /tmp/machinelearning-books (dataset)
$ cd machinelearning-books && du -sh
348K	.</code></pre>
          <pre class="fragment fade-in"><code>$ ls
A.Shashua-Introduction_to_Machine_Learning.pdf
B.Efron_T.Hastie-Computer_Age_Statistical_Inference.pdf
C.E.Rasmussen_C.K.I.Williams-Gaussian_Processes_for_Machine_Learning.pdf
D.Barber-Bayesian_Reasoning_and_Machine_Learning.pdf
[...]</code></pre>
          <li  class="fragment fade-in-then-semi-out"> retrieve annexed file's contents on demand:</li>
  <pre class="fragment fade-in"><code>$ datalad get A.Shashua-Introduction_to_Machine_Learning.pdf
  get(ok): /tmp/machinelearning-books/A.Shashua-Introduction_to_Machine_Learning.pdf (file) [from web...]</code></pre>
  <li  class="fragment fade-in-then-semi-out"> Drop annexed file's contents when done:</li>

  <pre class="fragment fade-in-then-semi-out"><code>$ datalad drop A.Shashua-Introduction_to_Machine_Learning.pdf
  drop(ok): /tmp/machinelearning-books/A.Shashua-Introduction_to_Machine_Learning.pdf (file) [checking https://arxiv.org/pdf/0904.3664v1.pdf...]</code></pre>
      </ul>
  <aside class="notes">
  Idea behind datalad: Enable a similar level of tooling and culture for the distribution and version control of data as it is present for open source software development
  </aside>
  </section>

  <section>
      <h2>Distributed availability</h2>
      <ul  style="font-size:30px">
      <li class="fragment fade-in" data-fragment-index="1">git-annex conceptualizes file availability information as a decentral network.
          A file can exist in multiple different locations. <em>git annex whereis</em>
          tells you which are known:</li>
          <pre class="fragment fade-in" data-fragment-index="1"><code class="fragment fade-in" data-fragment-index="1">$ git annex whereis inputs/images/chinstrap_02.jpg
whereis inputs/images/chinstrap_02.jpg (1 copy)
	00000000-0000-0000-0000-000000000001 -- web
	c1bfc615-8c2b-4921-ab33-2918c0cbfc18 -- adina@muninn:/tmp/my-dataset [here]

  web: https://unsplash.com/photos/8PxCm4HsPX8/download?force=true
ok
</code></pre>
          <li class="fragment fade-in" data-fragment-index="2">
              If a file has no other known storage locations, <em>drop</em> will warn
          </li>
          <ul style="font-size:25px">
          <li class="fragment fade-in" data-fragment-index="3">Here is a file with a registered remote location (the web)</li>
          <pre class="fragment fade-in" data-fragment-index="3"><code class="fragment fade-in" data-fragment-index="3">$ datalad drop inputs/images/chinstrap_02.jpg
drop(ok): /home/my-dataset/inputs/images/chinstrap_02.jpg (file)
$ datalad get inputs/images/chinstrap_02.jpg
get(ok): inputs/images/chinstrap_02.jpg (file)
</code></pre>
          <li class="fragment fade-in" data-fragment-index="3">Here is a file without a registered remote location (the web)
          </li>
          <pre class="fragment fade-in" data-fragment-index="3"><code class="fragment fade-in" data-fragment-index="3">$ datalad drop inputs/images/chinstrap_01.jpg
drop(error): inputs/images/chinstrap_01.jpg (file)
             [unsafe; Could only verify the existence of 0 out of 1 necessary copy;
             (Use --reckless availability to override this check, or adjust numcopies.)]</code></pre>
</ul>
      <li class="fragment fade-in" data-fragment-index="4">Delineation and advantages of decentral versus central RDM:<a href="https://doi.org/10.1515/nf-2020-0037" target="_blank">
          In defense of decentralized research data management</a>

       </ul>
  </section>

  <section>
      <h2>Data protection</h2>
      Why are annexed contents write-protected? (part I) <br><br>
        <ul style="font-size:30px">
            <li>Where the filesystem allows it, annexed files are symlinks:
            <pre><code>$ ls -l inputs/images/chinstrap_01.jpg
lrwxrwxrwx 1 adina adina 132 Apr  5 20:53 inputs/images/chinstrap_01.jpg -> ../../.git/annex/objects/1z/
xP/MD5E-s725496--2e043a5654cec96aadad554fda2a8b26.jpg/MD5E-s725496--2e043a5654cec96aadad554fda2a8b26.jpg
</code></pre><small>(PS: especially useful in datasets with many identical files) </small></li>
            <li>The symlink reveals git-annex internal data organization based on identity hash:
            <pre><code>$ md5sum inputs/images/chinstrap_01.jpg
2e043a5654cec96aadad554fda2a8b26  inputs/images/chinstrap_01.jpg
</code></pre></li>
            <li class="fragment fade-in">git-annex write-protects files to keep this symlink functional -
                Changing file contents without git-annex knowing would make the hash change and the symlink point to nothing</li>
            <li class="fragment fade-in">To (temporarily) remove the write-protection one can <em>unlock</em> the file</li>
        </ul>
  </section>

  <section data-transition="fade">
      <h2>Detour & Teaser: Reproducible data analysis</h2>
       Your past self is the worst collaborator:
      <img src="../pics/ownlegacycode_phd.png" height="500">
    <imgcredit>Full comic at <a href="http://phdcomics.com/comics.php?f=1689">http://phdcomics.com/comics.php?f=1979</a></imgcredit>
      </p>
      <small>Code: <a href="https://psychoinformatics-de.github.io/rdm-course/01-content-tracking-with-datalad/index.html#data-processing" target="_blank">
          psychoinformatics-de.github.io/rdm-course/01-content-tracking-with-datalad/index.html#data-processing</a> </small>
  </section>


  <section data-transition="None">
      <h2>Reproducible execution & provenance capture</h2>

      <p style="font-size:30px"><em>datalad run</em> wraps a command execution and records its impact on a dataset.
         <img class="fragment fade-in" src="../pics/run_prov_0.svg">
</section>

  <section data-transition="None">
      <h2>Reproducible execution & provenance capture</h2>

      <p style="font-size:30px"><em>datalad run</em> wraps a command execution and records its impact on a dataset.
          <pre style="max-height:none"><code style="max-height:none">commit 9fbc0c18133aa07b215d81b808b0a83bf01b1984 (HEAD -> main)
Author: Adina Wagner [adina.wagner@t-online.de]
Date:   Mon Apr 18 12:31:47 2022 +0200

    [DATALAD RUNCMD] Convert the second image to greyscale

    === Do not change lines below ===
    {
     "chain": [],
     "cmd": "python code/greyscale.py inputs/images/chinstrap_02.jpg outputs/im>
     "dsid": "418420aa-7ab7-4832-a8f0-21107ff8cc74",
     "exit": 0,
     "extra_inputs": [],
     "inputs": [],
     "outputs": [],
     "pwd": "."
    }
    ^^^ Do not change lines above ^^^

diff --git a/outputs/images_greyscale/chinstrap_02_grey.jpg b/outputs/images_gr>
new file mode 120000
index 0000000..5febc72
--- /dev/null
+++ b/outputs/images_greyscale/chinstrap_02_grey.jpg
@@ -0,0 +1 @@
+../../.git/annex/objects/19/mp/MD5E-s758168--8e840502b762b2e7a286fb5770f1ea69.>
\ No newline at end of file
</code></pre>
            <p style="font-size:30px">The resulting commit's hash (or any other identifier) can be used
      to automatically re-execute a computation (more on this tomorrow)</p> <!-- .element: class="fragment" -->
  </section>


  <section data-transition="None">
      <h2>Data protection</h2>
      Why are annexed contents write-protected? (part 2) <br><br>
        <ul style="font-size:30px">
            <li>When you try to modify an annexed file without unlocking you will see
                "Permission denied" errors.
            <pre><code>Traceback (most recent call last):
  File "/home/bob/Documents/rdm-warmup/example-dataset/code/greyscale.py", line 20, in module
    grey.save(args.output_file)
  File "/home/bob/Documents/rdm-temporary/venv/lib/python3.9/site-packages/PIL/Image.py", line 2232, in save
    fp = builtins.open(filename, "w+b")
PermissionError: [Errno 13] Permission denied: 'outputs/images_greyscale/chinstrap_02_grey.jpg'
</code></pre></li>
            <li class="fragment fade-in">Use <em>datalad unlock</em> to make the file modifiable.
                Underneath the hood (given the file system initially supported symlinks), this removes the symlink:
            <pre><code>$ datalad unlock outputs/images_greyscale/chinstrap_02_grey.jpg
$ ls outputs/images_greyscale/chinstrap_02_grey.jpg
-rw-r--r-- 1 adina adina 758168 Apr 18 12:31 outputs/images_greyscale/chinstrap_02_grey.jpg</code></pre></li>
            <li class="fragment fade-in"><em>datalad save</em> locks the file again.
                Locking and unlocking ensures that git-annex always finds the right version of a file.</li>
        </ul>
  </section>

  <section data-transition="None">
      <h2>Reproducible execution & provenance capture</h2>

      <p style="font-size:30px"><em>datalad run</em> wraps a command execution and records its impact on a dataset.
          <br><strong>In addition, it can take care of data retrieval and unlocking</strong></p>
      <img class="fragment fade-in" src="../pics/run_prov.svg" height="600"> <!-- .element: class="fragment" -->
  </section>

  <section>
      <h2>datalad rerun</h2>
      <ul style="font-size:30px">
          <li>
              <code>datalad rerun</code> is helpful to spare others and yourself
              the short- or long-term memory task, or the forensic skills to figure
              out how you performed an analysis
          </li>
          <li>
              But it is also a digital and machine-reable provenance record
          </li>
          <li>
              Important: The better the run command is specified, the better the
              provenance record
          </li>
          <li>
              Note: run and rerun only create an entry in the history if the command execution
              leads to a change.
          </li>
          <br><br>
          <li class="fragment fade-in">Task: Use <code>datalad rerun</code> to rerun the script execution.
              Find out if the output changed</li>
      </ul>
  </section>

  <section>
      <h3>Summary - Underneath the hood</h3>

      <ul style="font-size:30px">
        <dt class="fragment fade-in">Files are either kept in Git or in git-annex.</dt>
          <dd class="fragment fade-in"><em>datalad save</em> is used for both, but configurations (e.g., <em>text2git</em>), dataset rules
              (e.g., in a <em>.gitattributes</em> file, or flags change the default behavior
              of annexing everything</dd>
          <br>
        <dt class="fragment fade-in">Annexed files behave differently from files kept in Git:</dt>
          <dd class="fragment fade-in">They can be retrieved and dropped from local or remote locations, they are write-protected,
              their content is unkown to Git (and thus easy to keep private).</dd>
        <br>
          <dt class="fragment fade-in"><em>datalad clone</em> installs datasets from URLs or local or remote paths</dt>
          <dd class="fragment fade-in">Annexed files contents can be retrieved or dropped on demand, file contents of
              files stored in Git are available right away.</dd>
        <br>
        <dt class="fragment fade-in"><em>datalad unlock</em> makes annexed files modifiable, <em>datalad save</em>
        locks them again.</dt>
          <dd class="fragment fade-in">(It is generally easier to get accidentally saved files out of the annex than out of Git -
              see <a href="http://handbook.datalad.org/en/latest/basics/101-136-filesystem.html" target="_blank">handbook.datalad.org/basics/101-136-filesystem.html</a> for examples) </dd>
        <br>
        <dt class="fragment fade-in"><em>datalad run</em> records the impact of any command execution in
            a dataset. </dt>
          <dd class="fragment fade-in">Data/directories specified as <code>--input</code>
            are retrieved prior to command execution, data/directories specified as <code>--output</code> unlocked.</dd>
        <br>
        <dt class="fragment fade-in"><code>datalad rerun</code> can automatically re-execute run-records later.</dt>
          <dd class="fragment fade-in">They can be identified with any commit-ish (hash, tag, range, ...)</dd>

      </ul>
  </section>


  <section>
      <h2>Questions!</h2>
      <small>Awkward silence can be bridged with awkward MC questions :) </small>
          <iframe src="https://www.directpoll.com/r?XDbzPBd3ixYqg8huKIwKuJ7aj5lQw7fByQ4HgMgN",
              style="border: 0", width="930", height="900"></iframe>
  </section>
  </section>

  <section>
      <section>
          <h2>Before we continue...</h2>
                <small>Let your energy level define how we progress:</small><br>
          <iframe src="https://www.directpoll.com/r?XDbzPBd3ixYqg8huKIwKuJ7aj5lQw7fByQ4HgMgN",
              style="border: 0", width="930", height="900"></iframe>
      </section>
  </section>


			</div>
		</div>

		<script src="../reveal.js/dist/reveal.js"></script>
		<script src="../reveal.js/plugin/notes/notes.js"></script>
		<script src="../reveal.js/plugin/markdown/markdown.js"></script>
		<script src="../reveal.js/plugin/highlight/highlight.js"></script>
		<script>
			// More info about initialization & config:
			// - https://revealjs.com/initialization/
			// - https://revealjs.com/config/
			Reveal.initialize({
				hash: true,
				// The "normal" size of the presentation, aspect ratio will be preserved
				// when the presentation is scaled to fit different resolutions. Can be
				// specified using percentage units.
				width: 1280,
				height: 960,
				// Factor of the display size that should remain empty around the content
				margin: 0.3,
				// Bounds for smallest/largest possible scale to apply to content
				minScale: 0.2,
				maxScale: 1.0,

				controls: true,
				progress: true,
				history: true,
				center: true,
				slideNumber: 'c',
				pdfSeparateFragments: false,
				pdfMaxPagesPerSlide: 1,
				pdfPageHeightOffset: -1,
				transition: 'slide', // none/fade/slide/convex/concave/zoom
				// Learn about plugins: https://revealjs.com/plugins/
				plugins: [ RevealMarkdown, RevealHighlight, RevealNotes ]
			});
		</script>
	</body>
</html>