datalad-course/html/MPI_Berlin_02.html

<!doctype html>
<html>
	<head>
		<meta charset="utf-8">
		<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no">

		<!-- Edit me start! -->
		<title>This is where your title goes</title>
		<meta name="description" content=" This is where you put a short description ">
		<meta name="author" content=" Your Name ">
		<!-- Edit me end! -->

		<link rel="stylesheet" href="../reveal.js/dist/reset.css">
		<link rel="stylesheet" href="../reveal.js/dist/reveal.css">
		<link rel="stylesheet" href="../reveal.js/dist/theme/beige.css">

		<!-- Theme used for syntax highlighted code -->
		<link rel="stylesheet" href="../reveal.js/plugin/highlight/monokai.css">
	</head>
	<body>
		<div class="reveal">
			<div class="slides">


  <!--...Datalad Basics...-->

  <section>
<section>
<script src="https://cdn.logwork.com/widget/countdown.js"></script>
<a href="https://logwork.com/countdown-2zu8" class="countdown-timer"
   data-style="columns" data-timezone="Europe/Berlin" data-date="2020-11-18 10:50">
    DataLad Basics Session starts in</a>
</section>
  <section>
      <h2>DataLad Basics</h2>

      Code to follow along:
      <a href="http://handbook.datalad.org/r.html?MPIBerlin" target="_blank">
          handbook.datalad.org/r.html?MPIBerlin
      </a>
  </section>

  <section>
    <h2>Prerequisites: Installation and Configuration</h2>
          <div class="fragment fade-in">
          <li>Your installed version of DataLad should be recent</li>
          <pre><code>datalad --version
0.13.5</code></pre></div>
          <div class="fragment fade-in">
          <li>You should have a configured Git identity</li>
          <pre><code class="bash">$ git config --list
user.name=Adina Wagner
user.email=adina.wagner@t-online.de
[...]
</code></pre></div>
      <div class="fragment fade-in">Else, find installation and configuration
      instructions at <a href="http://handbook.datalad.org/en/latest/intro/installation.html" target="_blank">
              handbook.datalad.org</a> </div>
  </section>

  <section>
      <h2>Using DataLad</h2>

      <ul>
          <div>
          <li>DataLad can be used from the command line</li>
          <pre><code>datalad create mydataset</code></pre></div>
          <div>
          <li>... or with its Python API</li>
          <pre><code class="python">import datalad.api as dl
dl.create(path="mydataset")</code></pre></div>
          <div class="fragment fade-in">
          <li>... and other programming languages can use it via system call</li>
          <pre><code class="python"># in R
> system("datalad create mydataset")
</code></pre></div>
      </ul>
      </ul>
  </section>


  <section>
      <h2>DataLad Datasets</h2>

      <ul>
          <li>DataLad's core data structure</li>
          <ul>
              <li>Dataset = A directory managed by DataLad</li>
              <li>Any directory of your computer can be managed by DataLad.</li>
              <li class="fragment fade-in">Datasets can be <i>created</i> (from scratch) or <i>installed</i></li>
              <li class="fragment fade-in">Datasets can be nested: <i>linked subdirectories</i></li>
          </ul>
          <li class="fragment fade-in">Let's start by creating a dataset</li>
      </ul>

  <aside class="notes">
      <li>anything can be managed: CV, website, music library, phd</li>
      <li>show this on the manuscript repo: history, looks/feels</li>
  </aside>
  </section>

<section>
    <h2>DataLad Datasets</h2>
    A DataLad dataset is a joined Git + git-annex repository
    <img src="../pics/slides/pics/datalad_sandwhich_tuned/sandwhich03.svg">
</section>

  <section>
      <h2>Why version control?</h2>
      <img src="../pics/final.png" style="box-shadow: 10px 10px 8px #888888;height=600px" height="600"><br>
      <ul>
          <li class="fragment fade-in">keep things organized</li>
          <li class="fragment fade-in">keep track of changes</li>
          <li class="fragment fade-in">revert changes or go back to previous states</li>
      </ul>
  <aside class="notes">
  <li>Not only manuscripts, but also data!</li>
  </aside>
  </section>

  <section>
      <h2>Version Control</h2>

      <ul>
          <li>DataLad knows two things: Datasets and files</li>
          <img class="fragment fade-in" data-fragment-index="1" style="box-shadow: 5px 5px 3px #888888" src="../pics/artwork/src/dataset.svg" height="330"> <img style="box-shadow: 5px 5px 3px #888888" height="330" class="fragment fade-in" data-fragment-index="2" src="../pics/artwork/src/local_wf.svg">
       </ul><br>
      <li class="fragment fade-in">
          Every file you put into a in a dataset can be easily version-controlled,
          regardless of size, with the same command. </li>
  </section>


  <section>
      <h2>Local version control</h2>

      <p>Procedurally, version control is easy with DataLad!</p>
      <img class="fragment fade-in" src="../pics/local_wf.svg" height="500"> <!-- .element: class="fragment" -->
      <br>

      <b class="fragment fade-in">Advice:</b>
      <ul>
        <li class="fragment fade-in">Save <i>meaningful</i> units of change</li>
        <li class="fragment fade-in">Attach helpful commit messages</li>
      </ul>
  </section>

  <section data-markdown><script type="text/template" >

  ### This means: You can also version control data! <!-- .element: class="fragment" -->

  <pre><code class="bash" style="max-height:none">$ datalad save \
     -m "Adding raw data from neuroimaging study 1" \
     sub-*
  add(ok): sub-1/anat/T1w.json (file)
  add(ok): sub-1/anat/T1w.nii.gz (file)
  add(ok): sub-1/anat/T2w.json (file)
  add(ok): sub-1/anat/T2w.nii.gz (file)
  add(ok): sub-1/func/sub-1-run-1_bold.json (file)
  add(ok): sub-1/func/sub-1-run-1_bold.nii.gz (file)
  add(ok): sub-10/anat/T1w.json (file)
  add(ok): sub-10/anat/T1w.nii.gz (file)
  add(ok): sub-10/anat/T2w.json (file)
  add(ok): sub-10/anat/T2w.nii.gz (file)
    [110 similar messages have been suppressed]
  save(ok): . (dataset)
  action summary:
    add (ok: 120)
    save (ok: 1)
  </code></pre>  <!-- .element: class="fragment" -->

  </script>
  </section>

  <section data-markdown><script type="text/template" >
  ## Version Control
  * Your dataset can be a complete research log, capturing everything that was done, when, by whom, and how
  ![](../pics/researchlog.png)
  * Interact with the history:
    * reset your dataset (or subset of it) to a previous state,
    * throw out changes or bring them back,
    * find out what was done when, how, why, and by whom
    * Identify precise versions: Use data in the most recent version, or the one from 2018, or...
    * ...
  </script>
  </section>

  <section>
      <h2>Start to record provenance</h2>
      <ul>
          <li>
              Have you ever saved a PDF to read later onto your computer, but forgot
              where you got it from?
          </li>
          <li class="fragment fade-in">
              Digital Provenance = <i>"The tools and processes used to create a
              digital file, the responsible entity, and when and where the process
              events occurred"</i>
          </li>
          <li class="fragment fade-in">
              The history of a dataset already contains provenance, but there is more
              to record - for example: Where does a file come from?
              <code>datalad download-url</code> is helpful
          </li>
      </ul>
  </section>

    <section>
      <h3>Summary - Local version control</h3>

  <dl>
        <dt class="fragment fade-in"><code>datalad create</code> creates an empty dataset.</dt> <dd class="fragment fade-in">Configurations (<b>-c yoda</b>, <b>-c text2git</b>) are useful (details soon).</dd>
        <br>
        <dt class="fragment fade-in">A dataset has a <i>history</i> to track files and their modifications. </dt><dd class="fragment fade-in">Explore it with Git (<b>git log</b>) or external tools (e.g., <b>tig</b>).</dd>
        <br>
        <dt class="fragment fade-in"><code>datalad save</code> records the dataset or file state to the history. </dt><dd class="fragment fade-in">Concise <b>commit messages</b> should summarize the change for future you and others.</dd>
        <br>
        <dt class="fragment fade-in"><code>datalad download-url</code> obtains web content and records its origin. </dt><dd class="fragment fade-in">It even takes care of saving the change.</dd>
        <br>
        <dt class="fragment fade-in"><code>datalad status</code> reports the current state of the dataset.</dt>
      <dd class="fragment fade-in">A clean dataset status (no modifications, not untracked files) is good practice.</dd>
      </dl>
  </section>


  <section data-markdown><script type="text/template">
  ## From here <span class="fragment" data-fragment-index="1" style="margin-left:350px">to this:</span>
  ![](../pics/finaldoc_comic.gif)<!-- .element: height="780" style="box-shadow: 10px 10px 8px #888888" -->
  ![](../pics/gitflow.png)<!-- .element: class="fragment" data-fragment-index="1" height="780" style="box-shadow: 10px 10px 8px #888888" -->

  <p class="fragment" data-fragment-index="2">BUT: Version control is only one aspect of data management</p>

  </script>
  </section>


  <section>
      <h2>Questions!</h2>
          <iframe src="https://directpoll.com/r?XDbzPBd3ixYqg8p6wRBqfe5tLIzeHqInMYLnBb2kAc",
              style="border: 0", width="930", height="900"></iframe>
  </section>
  </section>

  <section>

  <section data-markdown><script type="text/template" >
  ## Consuming datasets
  * A dataset can be created from scratch/existing directories:
  <pre><code class="bash" style="max-height:none">$ datalad create mydataset
  [INFO   ] Creating a new annex repo at /home/adina/mydataset
  create(ok): /home/adina/mydataset (dataset)
  </code></pre>
  * but datasets can also be installed from paths or from URLs:
  <pre><code class="bash" style="max-height:none">$ datalad clone https://github.com/datalad-datasets/human-connectome-project-openaccess HCP
  install(ok): /tmp/HCP (dataset)
  </code></pre>
  </script>
  </section>
  <section>
      <h2>Consuming datasets</h2>

    <ul>
      <li class="fragment fade-in">Here's how a dataset looks after installation:</li>
        <img class="fragment fade-in" src="../pics/getdata.gif" height="700">
      <li class="fragment fade-in">Datasets are light-weight: Upon installation, only small
      files and meta data about file availability are retrieved.</li>
    </ul>
  </section>

  <section>
      <h2>Plenty of data, but little disk-usage</h2>
      <ul>
          <li class="fragment fade-in-then-semi-out">Cloned datasets are lean.
              "Meta data" (file names, availability) are present, but <b>no file content</b>:</li>
  <pre class="fragment fade-in"><code>$ datalad clone git@github.com:psychoinformatics-de/studyforrest-data-phase2.git
  install(ok): /tmp/studyforrest-data-phase2 (dataset)
  $ cd studyforrest-data-phase2 && du -sh
  18M	.</code></pre>

  <li class="fragment fade-in-then-semi-out">  file's contents can be retrieved on demand:</li>
      </ul>
  <pre class="fragment fade-in"><code>$ datalad get sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz
  get(ok): /tmp/studyforrest-data-phase2/sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz (file) [from mddatasrc...]</code></pre>

  <li class="fragment fade-in">Have more access to your computer than you have disk-space:</li>
  <pre class="fragment fade-in"><code># eNKI dataset (1.5TB, 34k files):
  $ du -sh
    1.5G	.
  # HCP dataset (80TB, 15 million files)
  $ du -sh
  48G	.
  </code></pre>
  </section>

  <section data-markdown> <script type="text/template">
  ## Plenty of data, but little disk-usage

  Drop file content that is not needed:<!-- .element: class="fragment fade-in" -->
  <pre class="fragment fade-in-then-semi-out"><code>$ datalad drop sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz
  drop(ok): /tmp/studyforrest-data-phase2/sub-01/ses-movie/func/sub-01_ses-movie_task-movie_run-1_bold.nii.gz (file) [checking https://arxiv.org/pdf/0904.3664v1.pdf...]</code></pre>
  When files are dropped, only "meta data" stays behind, and they can be re-obtained on demand.
  This allows disk-space aware computations: <!-- .element: class="fragment fade-in" -->


  Install your input data <!-- .element: class="fragment fade-in" -->
    *➡ get the data you need* <!-- .element: class="fragment fade-in" -->
    *➡ compute your results* <!-- .element: class="fragment fade-in" -->
    *➡ drop input data (and potentially all automatically re-computable results)* <!-- .element: class="fragment fade-in" -->
<pre><code class="python">dl.get('input/sub-01')
    [really complex analysis]
    dl.drop('input/sub-01')
</code></pre><!-- .element: class="fragment fade-in" -->
  </script></section>

  <section>
      <h2>Git versus Git-annex</h2>
      <dl>
          <dt>Data in datasets is either stored in Git or git-annex</dt>
          <dd>By default, everything is <i>annexed</i>, i.e., stored in a dataset annex by git-annex</dd><br>

          <br>
                    <small>
          <table>
              <tr>
                  <td><b>Git</b></td>
                  <td><b>git-annex</b></td>
              </tr>
              <tr>
                  <td>handles <b>small</b> files well (text, code)</td>
                  <td>handles <b>all</b> types and sizes of files well</td>
              </tr>
              <tr>
                  <td>file contents are in the Git history
                      and will be <b>shared</b> upon git/datalad push</td>
                  <td>file contents are in the annex. Not necessarily shared</td>
              </tr>
              <tr>
                  <td>Shared with every dataset clone</td>
                  <td><b>Can be kept private</b> on a per-file level when sharing the dataset</td>
              </tr>
              <tr>
                  <td>Useful: Small, non-binary, frequently modified, need-to-be-accessible (DUA, README) files </td>
                  <td>Useful: Large files, private files</td>
              </tr>
          </table>
              </small>
          <br><br>
          <li class="fragment fade-in-then-semi-out">With annexed data, only content identity (hash)
              and location information is put into Git, rather than file content.
              The annex, and transport to and from it is managed with <b>git-annex</b>
      </dl>
  </section>

  <section>
      <h2>Git versus Git-annex</h2>
      <img height="500" src="../pics/artwork/src/publishing/publishing_gitvsannex.svg">
  </section>

  <section>
      <h2>Git versus Git-annex</h2>
      <small>Useful background information for demo later. Read
          <a href="http://handbook.datalad.org/en/latest/basics/101-115-symlinks.html" target="_blank">
          this handbook chapter</a> for details
      </a> </small><br>
      Git and Git-annex handle files differently: annexed files are stored in an annex.
      File content is hashed & only content-identity is committed to Git.
      <ul>
        <table>
            <tr>
                <td>
                    <li>Files stored in Git are modifiable, files stored in Git-annex are content-locked</li>
                </td>
                <td width="60%">
                    <img src="../pics/git_vs_gitannex.svg" height="500">
                </td>
            </tr>
                  </table>

         <li>Annexed contents are not available right after cloning,
             only content identity and availability information (as they are stored in Git).
             Everything that is annexed needs to be retrieved with <code>datalad get</code> from whereever it is stored.
         </li>
      </ul>
  </section>


  <section>
      <h2>Git versus Git-annex</h2>
      <ul>
          When sharing datasets with someone without access to the same computational
          infrastructure, annexed data is not necessarily stored together with the rest
          of the dataset (more in the <b>session on publishing</b>).
      </ul>
      <img src="../pics/services_connected.png" height="500">
      <ul>
          Transport logistics exist to interface with all major storage providers.
          If the one you use isn't supported, let us know!
      </ul>
  </section>


  <section>
      <h2>Git versus Git-annex</h2>
      <ul>
          Users can decide which files are annexed:
          <br><br>
          <li><b>Pre-made run-procedures</b>, provided by DataLad (e.g., <code>text2git</code>, <code>yoda</code>)
              or created and shared by users
              (<a href="http://handbook.datalad.org/en/latest/basics/101-124-procedures.html" target="_blank">Tutorial</a>) </li>
          <li>Self-made configurations in <code>.gitattributes</code> (e.g., based on file type,
              file/path name, size, ...; <a href="http://handbook.datalad.org/en/latest/basics/101-123-config2.html#gitattributes" target="_blank">
                  rules and examples
              </a> )</li>
          <li>Per-command basis (e.g., via <code>datalad save --to-git</code>)</li>
      </ul>
  </section>


  <section data-transition="None">
      <h2>Transport logistics</h2>
      <ul>
          <li class="fragment fade-in-then-semi-out">Disk-space aware workflows: Cloned datasets are lean (only Git):</li>
                  <pre class="fragment fade-in"><code>$ datalad clone git@github.com:datalad-datasets/machinelearning-books.git
  install(ok): /tmp/machinelearning-books (dataset)
  $ cd machinelearning-books && du -sh
  348K	.</code></pre>
          <pre class="fragment fade-in"><code>$ ls
  A.Shashua-Introduction_to_Machine_Learning.pdf
  B.Efron_T.Hastie-Computer_Age_Statistical_Inference.pdf
  C.E.Rasmussen_C.K.I.Williams-Gaussian_Processes_for_Machine_Learning.pdf
  D.Barber-Bayesian_Reasoning_and_Machine_Learning.pdf
  [...]</code></pre>
          <li  class="fragment fade-in-then-semi-out"> annexed file's contents can
          be retrieved & dropped on demand:</li>
      </ul>
  <pre class="fragment fade-in"><code>$ datalad get A.Shashua-Introduction_to_Machine_Learning.pdf
  get(ok): /tmp/machinelearning-books/A.Shashua-Introduction_to_Machine_Learning.pdf (file) [from web...]</code></pre>
  <pre class="fragment fade-in-then-semi-out"><code>$ datalad drop A.Shashua-Introduction_to_Machine_Learning.pdf
  drop(ok): /tmp/machinelearning-books/A.Shashua-Introduction_to_Machine_Learning.pdf (file) [checking https://arxiv.org/pdf/0904.3664v1.pdf...]</code></pre>

  <aside class="notes">
  Idea behind datalad: Enable a similar level of tooling and culture for the distribution and version control of data as it is present for open source software development
  </aside>
  </section>

  <section>
      <h2>git-annex protects your files</h2>
      <ul>
          <li>
              If git-annex does not know any other storage location for a file it will <br>
              warn you and refuse to drop content (can be configured)
          </li>
          <li class="fragment fade-in" data-fragment-index="1">Here is a file with a registered remote location (the web)</li>
      </ul>
          <pre class="fragment fade-in" data-fragment-index="1"><code class="fragment fade-in" data-fragment-index="1">$ datalad drop .easteregg
drop(ok): /demo/myanalysis/.easteregg (file) [checking https://imgs.xkcd.com/comics/fuck_grapefruit.png...]
</code></pre>
      <ul>
          <li class="fragment fade-in" data-fragment-index="2">Here is a file without a registered remote location (the web)</li>
      </ul>
          <pre class="fragment fade-in" data-fragment-index="2"><code class="fragment fade-in" data-fragment-index="2">$ datalad drop compiling.png
[WARNING] Running drop resulted in stderr output: git-annex: drop: 1 failed
[ERROR  ] unsafe; Could only verify the existence of 0 out of 1 necessary copies; Rather than dropping this file, try using: git annex move; (Use --force to override this check, or adjust numcopies.) [drop(/demo/myanalysis/compiling.png)]
drop(error): /demo/myanalysis/compiling.png (file) [unsafe; Could only verify the existence of 0 out of 1 necessary copies; Rather than dropping this file, try using: git annex move; (Use --force to override this check, or adjust numcopies.)]</code></pre>
      </ul>
      <li class="fragment fade-in">If a different location for file content is known,
          <code>datalad get</code> can retrieve file content after dropping</li>
  </section>


  <section data-transition="None">
      <h2>Dataset nesting</h2>

      <ul>
          <li>Seamless nesting mechanisms:
                  <img height="330"  src="../pics/artwork/src/linkage_subds.svg">
          <br>
          <li class="fragment fade-in" data-fragment-index="2">Overcomes scaling issues with large amounts of files</li>
          <pre  class="fragment fade-in" data-fragment-index="2"><code>adina@bulk1 in /ds/hcp/super on git:master❱ datalad status --annex -r
  15530572 annex'd files (77.9 TB recorded total size)
  nothing to save, working tree clean</code></pre>
          <small><a class="fragment fade-in" data-fragment-index="2" href="https://github.com/datalad-datasets/human-connectome-project-openaccess" target="_blank">(github.com/datalad-datasets/human-connectome-project-openaccess)</a></small>
          <li class="fragment fade-in">
              Modularizes research components for transparency, reuse, and access
              management (more on this in the <b>section on reproducible science</b>)
          </li>
      </ul>


      <aside class="notes">
          Two advantages:
          <ul>
              <li>Scalable, size-independent version control</li>
              <li>Modularization of research components to increase transparency
                  and aid component reuse, as individual components can be flexibly
              puzzled together into new research objects, while being uniquely identified and versioned</li>
          </ul>

          At this point: Fixed data management, laid a foundation for updating data
      </aside>
  </section>


  <section>
      <h2>Dataset nesting</h2>
      <img src="../pics/linkage.svg" height="500">
  </section>

<section>
<h2>DataLad: Dataset linkage</h2>
<img data-src="../pics/linkage.svg" height="300">
<pre><code class="bash" style="font-size:115%;max-height:none">$ datalad clone --dataset . http://example.com/ds inputs/rawdata
</code></pre>

<pre><code class="diff" style="max-height:none">$ git diff HEAD~1
diff --git a/.gitmodules b/.gitmodules
new file mode 100644
index 0000000..c3370ba
--- /dev/null
+++ b/.gitmodules
@@ -0,0 +1,3 @@
+[submodule "inputs/rawdata"]
+       path = inputs/rawdata
+       url = http://example.com/importantds
diff --git a/inputs/rawdata b/inputs/rawdata
new file mode 160000
index 0000000..fabf852
--- /dev/null
+++ b/inputs/rawdata
@@ -0,0 +1 @@
+Subproject commit fabf8521130a13986bd6493cb33a70e580ce8572
</code></pre>
    <aside class="notes">weighs just a few bytes</aside>
</section>

  <section>
      <h3>Summary - Dataset consumption & nesting</h3>

      <ul>
        <dt class="fragment fade-in"><code>datalad clone</code> installs a dataset.</dt><dd class="fragment fade-in"> It can be installed “on its own”:
        Specify the source (url, path, ...) of the dataset, and an optional <b>path</b> for it to be installed to.</dd>
        <br>
        <dt class="fragment fade-in">Datasets can be installed as subdatasets within an existing dataset. </dt> <dd class="fragment fade-in"> The <b>--dataset/-d</b> option needs a path to the root of the superdataset.</dd>
        <br>
        <dt class="fragment fade-in">Only small files and metadata about file availability are present locally after an install. </dt>
          <dd class="fragment fade-in">To retrieve actual file content of annexed files,
              <code>datalad get </code> downloads file content on demand.</dd>
        <br>
        <dt class="fragment fade-in">Datasets preserve their history.</dt> <dd class="fragment fade-in">The superdataset records only the <i>version state</i> of the subdataset.</dd>

      </ul>
  </section>

  <section>
      <h2>Questions!</h2>
          <iframe src="https://directpoll.com/r?XDbzPBd3ixYqg8p6wRBqfe5tLIzeHqInMYLnBb2kAc",
              style="border: 0", width="930", height="900"></iframe>
  </section>
  </section>

  <section>
  <section data-transition="fade">
      <h2>reproducible data analysis</h2>
       Your past self is the worst collaborator:
      <img src="../pics/ownlegacycode_phd.png" height="500">
    <imgcredit>Full comic at <a href="http://phdcomics.com/comics.php?f=1689">http://phdcomics.com/comics.php?f=1979</a></imgcredit>
      </p>
  </section>

  <section>
      <h2>Basic organizational principles for datasets</h2>
      <dl>
          <dt>Keep everything clean and modular</dt>
          <li>An analysis is a superdataset, its components are subdatasets, and its structure modular</li>
          <table>
              <tr>
                  <td><img src="../pics/dataset_modules.png" height="400"></td>
                  <td><pre><code class="bash" style="max-height:none">├── code/
  │   ├── tests/
  │   └── myscript.py
  ├── docs
  │   ├── build/
  │   └── source/
  ├── envs
  │   └── Singularity
  ├── inputs/
  │   └─── data/
  │       ├── dataset1/
  │       │   └── datafile_a
  │       └── dataset2/
  │           └── datafile_a
  ├── outputs/
  │   └── important_results/
  │       └── figures/
  └── README.md</code></pre></td>
              </tr>
          </table>

      </dl>
      <ul>
      <li>do not touch/modify raw data: save any results/computations <i>outside</i> of input datasets</li>
      <li>Keep a superdataset self-contained: Scripts reference subdatasets or files with <i>relative paths</i></li>
      </ul>
  </section>

  <section>
      <h2>Basic organizational principles for datasets</h2>
      <dl>
          <dt>Record where you got it from, where it is now, and what you do to it</dt>
          <li>Link datasets (as subdatasets), record data origin</li>
          <li>Collect and store provenance of all contents of a dataset that you create</li>
              <table style="verticala-lign:middle">
                  <tr><img src="../pics/dataset_linkage_provenance.png"></tr>
              </table>
          <dl>
              <dt>Document everything:</dt>
              <li>Which script produced which output? From which data? In which software environment? ... </li>
          </dl>
      </dl>
      <note>Find out more about organizational principles in
          <a href="" target="_blank">the YODA principles</a>!</note>
  </section>

  <section>
      <h2>A classification analysis on the iris flower dataset</h2>
      <img src="../pics/iris-machinelearning.png" height="300">
      <img src="../pics/iris_cluster.png" height="450">
  </section>

  <section>
      <h2>Reproducible execution & provenance capture</h2>

      <p>datalad run</p>
      <img class="fragment fade-in" src="../pics/run_prov.svg" height="600"> <!-- .element: class="fragment" -->
  </section>

  <section>
      <h2>Computational reproducibility</h2>
      <ul>
          <li>Code may fail (to reproduce) if run with different software</li>
          <li>Datasets can store (and share) software environments (Docker or Singularity containers)
          and reproducibly execute code inside of the software container, capturing software as additional
          provenance</li>
          <li>DataLad extension: <code>datalad-container</code></li>
      </ul>

      <p>datalad-containers run</p>
      <img class="fragment fade-in" src="../pics/containers-run.svg" height="600"> <!-- .element: class="fragment" -->
  </section>

  <section>
      <h3>Summary - Reproducible execution</h3>

      <ul>
        <dt class="fragment fade-in"><code>datalad run</code> records a command and
            its impact on the dataset.</dt>
          <dd class="fragment fade-in">All dataset modifications are saved - use it
              in a clean dataset.</dd>
        <br>
        <dt class="fragment fade-in">Data/directories specified as <code>--input</code>
            are retrieved prior to command execution.</dt>
          <dd class="fragment fade-in"> Use one flag per input.</dd>
        <br>
        <dt class="fragment fade-in">Data/directories specified as <code>--output</code>
            will be unlocked for modifications prior to a rerun of the command. </dt>
          <dd class="fragment fade-in">Its optional to specify, but helpful for recomputations.</dd>
        <br>
        <dt class="fragment fade-in"><code>datalad containers-run</code> can be used
            to capture the software environment as provenance.</dt>
          <dd class="fragment fade-in">Its ensures computations are ran in the desired software set up.
              Supports Docker and Singularity containers</dd>
        <br>
        <dt class="fragment fade-in"><code>datalad rerun</code> can automatically re-execute run-records later.</dt>
          <dd class="fragment fade-in">They can be identified with any commit-ish (hash, tag, range, ...)</dd>

      </ul>
  </section>

  <section>
      <h2>datalad rerun</h2>
      <ul>
          <li>
              <code>datalad rerun</code> is helpful to spare others and yourself
              the short- or long-term memory task, or the forensic skills to figure
              out how you performed an analysis
          </li>
          <li>
              But it is also a digital and machine-reable provenance record
          </li>
          <li>
              Important: The better the run command is specified, the better the
              provenance record
          </li>
          <li>
              Note: run and rerun only create an entry in the history if the command execution
              leads to a change.
          </li>
      </ul>
  </section>

  <section>
      <h2>Questions!</h2>
          <iframe src="https://directpoll.com/r?XDbzPBd3ixYqg8p6wRBqfe5tLIzeHqInMYLnBb2kAc",
              style="border: 0", width="930", height="900"></iframe>
  </section>

  <section>
      <h2>Unlocking things</h2>
      <ul>
          <li><code>datalad run</code> "unlocks" everything specified as <code>--output</code></li>
          <li class="fragment fade-in" data-fragment-index="1">Outside of <code>datalad run</code>, you can use <code>datalad unlock</code></li>
          <li class="fragment fade-in" data-fragment-index="1">This makes annex'ed files <i>writeable</i>:</li>
      </ul>
          <pre class="fragment fade-in" data-fragment-index="1"><code class="fragment fade-in" data-fragment-index="1">$ ls -l myfile
lrwxrwxrwx 1 adina adina  108 Nov 17 07:08 myfile -> .git/annex/objects/22/Gw/MD5E-s7--f447b20a7fcbf53a5d5be013ea0b15af/MD5E-s7--f447b20a7fcbf53a5d5be013ea0b15af

# unlocking
$ datalad unlock myfile
unlock(ok): myfile (file)
$ ls -l myfile
-rw-r--r-- 1 adina adina    7 Nov 17 07:08 myfile  # not a symlink anymore!
</code></pre>
      <ul>
          <li class="fragment fade-in" data-fragment-index="2"><code>datalad save</code> "locks" the file again</li>
      </ul>
          <pre class="fragment fade-in" data-fragment-index="2"><code class="fragment fade-in" data-fragment-index="2">$ datalad save
add(ok): myfile (file)
action summary:
  add (ok: 1)
  save (notneeded: 1)

$ ls -l myfile
lrwxrwxrwx 1 adina adina 108 Nov 17 07:08 myfile -> .git/annex/objects/22/Gw/MD5E-s7--f447b20a7fcbf53a5d5be013ea0b15af/MD5E-s7--f447b20a7fcbf53a5d5be013ea0b15af</code></pre>
  <div class="fragment fade-in" data-fragment-index="3">Some tools (e.g., MatLab) don't like
  symlinks. Unlocking or running matlab with "datalad run" helps!</div>
  </section>


  <section>
      <h2>Removing datasets</h2>
      <ul>
          <li>
              As mentioned before, annexed data is write-protected.
              So when you try to <code>rm -rf</code> a dataset, this happens:
          </li>
      </ul>
<pre class="fragment fade-in" data-fragment-index="1"><code class="fragment fade-in" data-fragment-index="1">$ rm -rf mydataset
rm: cannot remove 'mydataset/.git/annex/objects/70/GM/MD5E-s27246--8b7ea027f6db1cda7af496e97d4eb7c9.png/MD5E-s27246--8b7ea027f6db1cda7af496e97d4eb7c9.png': Permission denied
rm: cannot remove 'mydataset/.git/annex/objects/70/GM/MD5E-s35756--af496e97d4eb7c98b7ea027f6db1cda7.png/MD5E-s27246--af496e97d4eb7c98b7ea027f6db1cda7.png': Permission denied
[...]
</code></pre>
      😱

      <li class="fragment fade-in">
          (If you accidentally ever do this, you need to apply write permissions recursively to
          all files)
          <pre><code>$ chmod -R +w mydataset
$ rm -rf mydataset              # success!
</code></pre>
      </li>
  </section>

  <section>
      <h2>Removing datasets</h2>
      <li>
          The correct way to remove a dataset is using <code>datalad remove</code>:
      </li>
      <pre><code>$ datalad remove -d ds001241
remove(ok): . (dataset)
action summary:
  drop (notneeded: 1)
  remove (ok: 1)
</code></pre>
      <li class="fragment fade-in" data-fragment-index="2">
          If a dataset contains file for which no other remote copy is known, you'll
          get a warning:
      </li>
      <pre class="fragment fade-in"  data-fragment-index="2"><code class="fragment fade-in"  data-fragment-index="2">$ datalad remove -d mydataset
[WARNING] Running drop resulted in stderr output: git-annex: drop: 1 failed

[ERROR  ] unsafe; Could only verify the existence of 0 out of 1 necessary copies; Rather than dropping this file, try using: git annex move; (Use --force to override this check, or adjust numcopies.) [drop(/tmp/mydataset/interdisciplinary.png)]
drop(error): interdisciplinary.png (file) [unsafe; Could only verify the existence of 0 out of 1 necessary copies; Rather than dropping this file, try using: git annex move; (Use --force to override this check, or adjust numcopies.)]
[WARNING] could not drop some content in /tmp/mydataset ['/tmp/mydataset/interdisciplinary.png'] [drop(/tmp/mydataset)]
drop(impossible): . (directory) [could not drop some content in /tmp/mydataset ['/tmp/mydataset/interdisciplinary.png']]
action summary:
  drop (error: 1, impossible: 1)</code></pre>
      <li class="fragment fade-in" data-fragment-index="3">
          In that case, use <code>--nocheck</code> to force removal:
      </li>
      <pre class="fragment fade-in"  data-fragment-index="3"><code class="fragment fade-in"  data-fragment-index="2">$ datalad remove -d mydataset --nocheck                                     1 !
remove(ok): . (dataset)
</code></pre>
  </section>

  <section>
      <h2>Removing datasets</h2>
      <li>
          If a dataset contains subdatasets, <code>datalad remove</code> will also error:
      </li>
      <pre class="fragment fade-in"  data-fragment-index="1"><code class="fragment fade-in"  data-fragment-index="1">$ datalad remove -d myds
drop(ok): README.md (file) [locking gin...]
drop(ok): . (directory)
[ERROR  ] to be uninstalled dataset Dataset(/tmp/myds) has present subdatasets, forgot --recursive? [remove(/tmp/myds)]
remove(error): . (dataset) [to be uninstalled dataset Dataset(/tmp/myds) has present subdatasets, forgot --recursive?]
action summary:
  drop (ok: 3)
  remove (error: 1)</code></pre>
      <li class="fragment fade-in" data-fragment-index="2">
          In that case, use <code>--recursive</code> to remove all subdatasets, too:
      </li>
      <pre class="fragment fade-in"  data-fragment-index="2"><code class="fragment fade-in"  data-fragment-index="2">$ datalad remove -d myds --recursive
uninstall(ok): input (dataset)
remove(ok): . (dataset)
action summary:
  drop (notneeded: 2)
  remove (ok: 1)
  uninstall (ok: 1)
</code></pre>
      <li class="fragment fade-in">
          A complete overview of file system operations is in
          <a href="http://handbook.datalad.org/en/latest/basics/101-136-filesystem.html" target="_blank">
              handbook.datalad.org/en/latest/basics/101-136-filesystem.html
          </a>
      </li>
  </section>
  </section>


<section>
<section>
    <h2>A machine-learning example</h2>
</section>

<section>
    <h2>Analysis layout</h2>
    <table>
        <tr>
            <td>
                <ul>
        <li>Prepare an input data set</li>
        <li class="fragment fade-in">Configure and setup an analysis dataset</li>
        <li class="fragment fade-in">Prepare data</li>
        <li class="fragment fade-in">Train models and evaluate them</li>
        <li class="fragment fade-in">Compare different models, repeat with updated data</li>
                </ul>
            </td>
            <td>
    <img src="../pics/imagenette.png" width="800">
                <small>Imagenette dataset</small>
            </td>
        </tr>
    </table>
</section>

<section>
    <h2>Prepare an input dataset</h2>
    <ul>
        <li>Create a stand-alone input dataset</li>
        <li>Either add data and <code>datalad save</code> it, or use commands such as <code>datalad download-url</code>
    or <code>datalad add-urls</code> to retrieve it from web-sources</li>
    </ul>
</section>

<section>
    <h2>Configure and setup an analysis dataset</h2>
    <ul>
        <li>Given the purpose of an analysis dataset, configurations can make it easier to use:</li>
            <ul>
                <li><code>-c yoda</code> prepares a useful structure</li>
                <li><code>-c text2git</code> keeps text files such as scripts in Git</li>
            </ul>
        <li>The input dataset is installed as a subdataset</li>
        <li>Required software is containerized and added to the dataset</li>
    </ul>
</section>

<section>
    <h2>Prepare data</h2>
    <ul>
        <li>Add a script for data preparation (labels train and validation images)</li>
        <li>Execute it using <code>datalad containers-run</code></li>
    </ul>
</section>

<section>
    <h2>Train models and evaluate them</h2>
    <ul>
        <li>Add scripts for training and evaluation.
            This dataset state can be tagged to identify it easily at a later point</li>
        <li>Execute the scripts using <code>datalad containers-run</code></li>
        <li>By dumping a trained model as a joblib object the trained classifier stays reusable</li>
    </ul>
</section>

<section>
    <h2>Tips and tricks for ML applications</h2>
    <ul>
        <dt class="fragment fade-in">Standalone input datasets keep input data extendable and reusable</dt>
        <dd class="fragment fade-in">Subdatasets can be registered in precise versions, and updated to the newest state</dd>
        <br>
        <dt class="fragment fade-in">Software containers aid greatly with reproducibility</dt>
        <dd class="fragment fade-in">The correct software environment is preserved and can be shared</dd>
        <br>
        <dt class="fragment fade-in">Re-executable run-records can capture all provenance</dt>
        <dd class="fragment fade-in">This can also capture command-line parametrization</dd>
        <br>
        <dt class="fragment fade-in">Git workflows can be helpful elements in ML workflows</dt>
        <dd class="fragment fade-in">DataLad is no workflow manager, but by checking out out tags
            or branches one can switch easy and fast between results of different models</dd>

        </li>
    </ul>
</section>
</section>

<section>
    <section data-transition="None">
    <h2>Why use DataLad?</h2>
    <ul>
        <li class="fragment fade-in">Mistakes are not forever anymore: Easy version control, regardless of file size</li>
        <li class="fragment fade-in">Who needs short-term memory when you can have run-records?</li>
        <li class="fragment fade-in">Disk-usage magic: Have access to more data than your hard drive has space</li>
        <li class="fragment fade-in">Collaboration and updating mechanisms: Alice shares her data with Bob. Alice fixes a mistake and pushes the fix.
        Bob says "datalad update" and gets her changes. And vice-versa.</li>
        <li class="fragment fade-in">Transparency: Shared datasets keep their history. No need to track down a former student,
        ask their project what was done.</li>
    </ul>
</section>
</section>


			</div>
		</div>

		<script src="../reveal.js/dist/reveal.js"></script>
		<script src="../reveal.js/plugin/notes/notes.js"></script>
		<script src="../reveal.js/plugin/markdown/markdown.js"></script>
		<script src="../reveal.js/plugin/highlight/highlight.js"></script>
		<script>
			// More info about initialization & config:
			// - https://revealjs.com/initialization/
			// - https://revealjs.com/config/
			Reveal.initialize({
				hash: true,
				// The "normal" size of the presentation, aspect ratio will be preserved
				// when the presentation is scaled to fit different resolutions. Can be
				// specified using percentage units.
				width: 1280,
				height: 960,
				// Factor of the display size that should remain empty around the content
				margin: 0.3,
				// Bounds for smallest/largest possible scale to apply to content
				minScale: 0.2,
				maxScale: 1.0,

				controls: true,
				progress: true,
				history: true,
				center: true,
				slideNumber: 'c',
				pdfSeparateFragments: false,
				pdfMaxPagesPerSlide: 1,
				pdfPageHeightOffset: -1,
				transition: 'slide', // none/fade/slide/convex/concave/zoom
				// Learn about plugins: https://revealjs.com/plugins/
				plugins: [ RevealMarkdown, RevealHighlight, RevealNotes ]
			});
		</script>
	</body>
</html>