datalad-course/html/mpsc-introduction.html

928 lines
37 KiB
HTML
Raw Permalink Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<!doctype html>
<html>
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no">
<!-- Edit me start! -->
<title>Welcome</title>
<meta name="description" content=" Workshop introduction ">
<meta name="author" content=" Adina Wagner, Michael Hanke ">
<!-- Edit me end! -->
<link rel="stylesheet" href="../reveal.js/dist/reset.css">
<link rel="stylesheet" href="../reveal.js/dist/reveal.css">
<link rel="stylesheet" href="../reveal.js/dist/theme/beige.css">
<link rel="stylesheet" href="../css/main.css">
<!-- Theme used for syntax highlighted code -->
<link rel="stylesheet" href="../reveal.js/plugin/highlight/monokai.css">
</head>
<body>
<div class="reveal">
<div class="slides">
<section>
<section>
<script src="https://cdn.logwork.com/widget/countdown.js"></script>
<a href="https://logwork.com/countdown-2zu8" class="countdown-timer"
data-style="columns" data-timezone="Europe/Berlin" data-date="2022-07-20 13:30">
Introduction starts in</a>
Have a ☕!
</section>
<section>
<h2>Research data management<br />👩‍💻👨‍💻<br />with DataLad</h2>
<div style="margin-top:1em;text-align:center">
<table style="border: none;">
<tr>
<td>
Adina Wagner<br><small><a href="https://twitter.com/AdinaKrik" target="_blank">
<img data-src="../pics/twitter.png" style="height:30px;margin:0px" />@AdinaKrik</a></small>
</td>
<td>
Michael Hanke<br><small><a href="https://twitter.com/eknahm" target="_blank">
<img data-src="../pics/twitter.png" style="height:30px;margin:0px" />@eknahm</a></small>
</td>
</tr>
<tr>
<td>
<img style="height:70px;margin-right:10px" data-src="../pics/fzj_logo.svg" /><br>
</td>
<td>
<small><a href="http://psychoinformatics.de" target="_blank">Psychoinformatics lab</a>,
<br> Institute of Neuroscience and Medicine (INM-7)<br>
Research Center Jülich</small><br>
</td>
</tr>
</table>
</div>
<br><br><small>
Slides: <a href="https://github.com/datalad-handbook/course/blob/master/talks/PDFs" target="_blank">
https://github.com/datalad-handbook/course/</a></small>
</a>
</section>
</section>
<!--...WORKSHOP INTRODUCTION...-->
<section>
<section>
<h2>Welcome!</h2>
Approximate workshop schedule<br><br>
<dl style="font-size:30px">
<dt>
Session 1 (now, 13.30-15.00)
</dt>
<dd>
Logistics & Intro🧑🏫, <br>
Hands-on Terminal Basics 💻, <br>
Demo of core functionality 🧑‍🏫💻
</dd>
<br>
<dt>
Session 2 (today, 16.00-18.00)
</dt>
<dd>
Hands-on DataLad Basics & Exercises 💻
</dd>
<br>
<dt>
Session 3 (tomorrow, 11.00-12.30)
</dt>
<dd>
Sharing and Collaboration 🧑‍🏫, <br>
Hands-on Data publication 💻
</dd>
<br>
<dt>
Session 4 (tomorrow, 13.30-15.00)
</dt>
<dd>
Computational reproducibility 🧑‍🏫💻, <br>
Outro 🧑‍🏫, <br>
Final QA ❔
</dd>
</dl>
</section>
<section>
<h2>Logistics and links</h2>
<ul style="font-size:30px">
<li>
You can download these slides at <a href="https://doi.org/10.5281/zenodo.6827086" target="_blank">
https://doi.org/10.5281/zenodo.6827086</a> (scan the QR code), and you can find their sources at
<a href="https://github.com/datalad-handbook/datalad-course/" target="_blank">
github.com/datalad-handbook/datalad-course </a> <br>
</li>
<li class="fragment fade-in">
Some of today's code-along workshop contents are at
<a href="https://psychoinformatics-de.github.io/rdm-course/" target="_blank">
psychoinformatics-de.github.io/rdm-course</a>
</li>
<li class="fragment fade-in">
The workshop will be interactive. If you do not have the software installed
on your own system, you can access a Jupyterhub from your browser at
<a href="https://datalad-hub.inm7.de/" target="_blank">datalad-hub.inm7.de</a>
<strong>(WIFI is bad, Jupyterhub is the better choice)</strong>
</li>
<li class="fragment fade-in">
You can log in to the Juypterhub with a pre-set username (take one out of the
jar) and a self-set password. Remember the password for tomorrow!
</li>
<li class="fragment fade-in">
A <a href="https://pip.pypa.io/en/stable/user_guide/#requirements-files" target="_blank">
requirements.txt</a> file on Zenodo details the software
environment we setup on the Jupyterhub
</li>
</ul>
<img src="../pics/QRcode_mpsc.png" height="250px" align="middle">
</section>
<section>
<h2>Interactivity</h2><br><br>
<ul style="font-size:30px">
<li class="fragment fade-in">
The workshop centers around
<a href="http://handbook.datalad.org/r.html?about" target="_blank">
<strong>DataLad</strong></a>
(version 0.16 and up) for real-world <strong>research data management </strong>use cases
</li>
<li class="fragment fade-in">
There are no stupid questions; ask anything any time <br>
</li>
<li class="fragment fade-in">
Something doesn't look right on your system?
Stick a post-it to your screen. We'll take a look together
</li>
<li class="fragment fade-in">
We're available outside of sessions, too. Chat about your
use cases or questions over a coffee or meal
</li>
</ul>
<table>
<tr>
<td style="vertical-align:top; font-size:35px">
<br><br>
<li class="fragment fade-in">
4 sessions = time for more than a <br>
standard introduction. <br></li>
<li class="fragment fade-in">
Materials are available <br>
online & persistent, we can<br>
be flexible & spontaneous <br>
if specific topics interest you
</li>
</td>
<td>
<img class="fragment fade-in" src="../pics/splits.jpg" width="600px">
</td>
</tr>
</table>
</section>
<section>
<h2>After the workshop</h2>
<ul>
If you have a question after the workshop, you can reach out for help:
<br>
<ul style="font-size:30px">
<dt>Reach out to to the <b>DataLad</b> team via</dt>
<li>
<a href="https://matrix.to/#/!NaMjKIhMXhSicFdxAj:matrix.org?via=matrix.waite.eu&via=matrix.org&via=inm7.de" target="_blank">
Matrix</a> (free, decentralized communication app, no app needed).
We run a weekly Zoom office hour (Thursday, 4pm Berlin time) from this room as well.
</li>
<li>
<a href="https://github.com/datalad/datalad" target="_blank">
the development repository on GitHub</a>
</li>
<br>
<dt>Reach out to the user community with</dt>
<li>A question on <a href="https://neurostars.org/" target="_blank">neurostars.org</a>
with a <code>datalad</code> tag</li>
<br>
<dt>Find more user tutorials or workshop recordings</dt>
<li>On <a href="https://www.youtube.com/channel/datalad" target="_blank">
DataLad's YouTube channel</a>
</li>
<li>
In the <a href="http://handbook.datalad.org/en/latest/" target="_blank">
DataLad Handbook </a>
</li>
<li>In the <a href="https://psychoinformatics-de.github.io/rdm-course/" target="_blank">DataLad RDM course</a> </li>
<li>In the <a href="http://docs.datalad.org" target="_blank">Official API documentation</a> </li>
</ul>
</ul>
</section>
<section>
<h2>Audience response system</h2>
Use your phone to scan the QR code, or open the link in a new browser window <br>
<iframe src="https://directpoll.com/r?XDbzPBdEt8j1rJlVwV5I4m6c9z8nJU2YLnRe3j3k",
style="border: 0" width="900" height="800"></iframe>
</section>
<section>
<h2>On a scale of rubber ducks...</h2>
<img src="../pics/rubberduckscale.png" height="600"><iframe src="https://directpoll.com/r?XDbzPBdEt8j1rJlVwV5I4m6c9z8nJU2YLnRe3j3k",
style="border: 0" width="400" height="600"></iframe>
</section>
</section>
<!--..Research data management in general..-->
<section>
<section>
<h2>Research data management</h2>
</section>
<section data-transition="None">
<h2>Common problems in science</h2>
<div class="fragment fade-in" data-fragment-index="1">
You write a paper & stay up late to generate good-looking figures,
but you have to tweak many parameters and display options.
The next morning, you have no idea which parameters produced which
figures, and which of the figures fit to what you report in the paper.<br>
<img height="400" src="../pics/turingway/findfiles.png">
<img height="400" src="../pics/turingway/projectstack.png"</div>
<imgcredit>Illustration adapted from Scriberia and The Turing Way</imgcredit>
</section>
<section data-transition="None">
<h2>Common problems in science</h2>
<div class="fragment fade-in" data-fragment-index="1">
Your research project produces phenomenal results, but your
laptop, the only place that stores the source code for the
results, is stolen or breaks<br>
<img height="700" src="../pics/stolenlaptop.jpg"></div>
<imgcredit>https://co.pinterest.com/pin/551128073121451139//imgcredit>
</section>
<section data-transition="None">
<h2>Common problems in science</h2>
<div class="fragment fade-in" data-fragment-index="1">
A graduate student complains that a research idea does not work.
Their supervisor can't figure out what the student did and how,
and the student can't sufficiently explain their approach
(data, algorithms, software).
Weeks of discussion and mis-communication ensues because the
supervisor can't first-hand explore or use the students project.
<br>
<img height="500" src="../pics/badsupervision.gif"></div>
<imgcredit>http://phdcomics.com/comics.php?f=1693</imgcredit>
</section>
<section data-transition="None">
<h2>Common problems in science</h2>
<div class="fragment fade-in" data-fragment-index="1">
You wrote a script during your PhD that applied a specific
method to a dataset. Now, with new data and a new project, you
try to reuse the script, but forgot how it worked.
<br>
<img height="500" src="../pics/frustration.jpg"></div>
<imgcredit>http://phdcomics.com/comics.php?f=1693</imgcredit>
</section>
<section data-transition="None">
<h2>common problems in science</h2>
<div class="fragment fade-in" data-fragment-index="1">
You try to recreate results from another lab's published paper.
You base your re-implementation on everything reported in their paper,
but the results you obtain look nowhere like the original.
<br>
<img height="500" src="../pics/turingway/ReadableCode.png"></div>
<imgcredit>http://phdcomics.com/comics.php?f=1693</imgcredit>
</section>
<section>
<h2><strike>common</strike> old problems in science</h2>
<div class="fragment fade-in" data-fragment-index="1">
All these problems were paraphrased from
<a href="https://sci-hub.se/https://link.springer.com/chapter/10.1007%2F978-1-4612-2544-7_5" target="_blank">
Buckheit & Donoho, <b>1995</b></a>
<br></div>
<div class="fragment fade-in">Let's do better!</div>
</section>
</section>
<!--...WHAT IS DATALAD...-->
<section>
<section data-transition="fade">
<div><table>
<tr><dl>
<img src="../pics/datalad_logo_wide.svg" height="150"><br>
<b><a href="https://www.datalad.org/" target="_blank"> DataLad</a>
can help <br> with small or large-scale <br> data management </b>
<dt></dt>
</dl></tr>
<tr><dl class="fragment fade-in">Free, <br> open source, <br> command line tool & Python API </dl></tr>
</table>
</div>
<ul style="vertical-align:middle">
<br>
<dt></dt>
</ul>
</section>
<section>
<h2> <img src="../pics/datalad_logo_wide.svg"></h2>
<ul>
<li>A command-line tool, available for all major operating systems
(Linux, macOS/OSX, Windows), MIT-licensed</li>
<li>Build on top of <a href="https://git-scm.com/" target="_blank">Git</a>
and <a href="https://git-annex.branchable.com/" target="_blank">Git-annex</a></li>
<dt><li>Allows...</li></dt>
<dt>... version-controlling arbitrarily large content </dt>
<dd>version control data and software alongside to code!</dd>
<dt>... transport mechanisms for sharing and obtaining data </dt>
<dd>consume and collaborate on data (analyses) like software</dd>
<dt>... (computationally) reproducible data analysis</dt>
<dd>Track and share provenance of all digital objects</dd>
<dt>... and <i>much</i> more </dt>
<li>Completely domain-agnostic</li>
<br>
</ul>
</section>
<section>
<h2>Acknowledgements</h2>
<table>
<tr style="vertical-align:top">
<td style="vertical-align:top">
<dl>
<dt>Software</dt>
<dd style="margin-left:5px!important">
<ul style="margin-left:5px!important">
<li>Joey Hess (git-annex)</li>
<li>The DataLad team &
contributors</li>
</ul>
</dd>
<dt style="margin-top:20px">Illustrations </dt>
<dd style="margin-left:5px!important">
<ul style="margin-left:5px!important">
<li>The Turing Way <br>
project & Scriberia</li>
<img src="../pics/bannerthanks.svg">
</ul>
</dd>
</dl>
</td>
<td style="vertical-align:top">
<div style="margin-bottom:-20px;text-align:center"><strong>Funders</strong></div>
<img style="height:150px;margin-right:50px" data-src="../pics/nsf_2020.png" />
<img style="height:150px;margin-right:50pxi;margin-left:50px" data-src="../pics/binc.png" />
<img style="height:150px;margin-left:50px" data-src="../pics/bmbf_2020.png" />
<img style="height:80px;margin-top:-40px;margin-left:auto;margin-right:auto;width:100%" data-src="../pics/fzj_logo.svg" />
<div style="margin-top:-20px">
<img style="height:60px;margin-right:20px" data-src="../pics/erdf.png" />
<img style="height:60px;margin-right:20px" data-src="../pics/cbbs_logo.png" />
<img style="height:60px" data-src="../pics/LSA-Logo.png" />
</div>
<div style="margin-top:40px;margin-bottom:20px;text-align:center"><strong>Collaborators</strong></div>
<div style="margin-top:-20px">
<img style="height:100px;margin:20px" data-src="../pics/hbp_logo.png" />
<img style="height:100px;margin:20px" data-src="../pics/conp_logo.png" />
<img style="height:100px;margin:20px" data-src="../pics/vbc_logo.png" />
</div>
<div style="margin-top:-40px">
<img style="height:120px;margin:20px" data-src="../pics/openneuro_logo.png" />
<img style="height:120px;margin:20px" data-src="../pics/cbrain_logo.png" />
<img style="height:140px;margin:20px" data-src="../pics/brainlife_logo.png" />
</div>
</td>
</tr>
</table>
</section>
<section data-transition="None">
<h3>
Examples of what DataLad can be used for:
</h3>
<ul>
<li class="fragment fade-in-then-semi-out">
Behind-the-scenes <b>infrastructure component for data transport and versioning</b>
(e.g., used by <a href="https://openneuro.org/" target="_blank"> OpenNeuro</a>,
<a href="https://brainlife.io/" target="_blank"> brainlife.io </a>,
the <a href="https://conp.ca/" target="_blank">Canadian Open Neuroscience Platform (CONP)</a>,
<a href="https://mcin.ca/technology/cbrain/" target="_blank"> CBRAIN</a>)</li>
<img height="800" class="fragment fade-in" src="../pics/openneuro2.gif" alt="a screenrecording of browsing open neuro">
</ul>
</section>
<section data-transition="None">
<h3>
Examples of what DataLad can be used for:
</h3>
<ul>
<li class="fragment fade-in-then-semi-out"> <b>Creating and sharing reproducible, open science</b>: Sharing data, software, code, and provenance </li>
<img height="800" class="fragment fade-in" src="../pics/shareresearch2.gif" alt="a screenrecording of cloning REMODNAV paper dataset from github">
</ul>
</section>
<section data-transition="None">
<h3>
Examples of what DataLad can be used for:
</h3>
<ul>
<li> <b>Creating and sharing reproducible, open science</b>: Sharing data, software, code, and provenance </li>
<img height="800" class="fragment fade-in" src="../pics/openscience.gif" alt="a screenrecording of cloning REMODNAV paper dataset from github">
</ul>
</section>
<section data-transition="None">
<h3>
Examples of what DataLad can be used for:
</h3>
<ul>
<li class="fragment fade-in-then-semi-out"><b>Central data management</b> and archival system</li>
<img height="850" class="fragment fade-in" src="../pics/centralmanagement.gif">
</ul>
</section>
<section data-transition="None">
<h3>
Examples of what DataLad can be used for:
</h3>
<ul>
<li class="fragment fade-in-then-semi-out"><b>Scalable computing framework</b> for reproducible science</li>
<img height="350" class="fragment fade-in" src="../pics/fairly-big.png">
<img height="500" class="fragment fade-in" src="../pics/ukb_datasets.svg">
</ul>
</section>
</section>
<section>
<section data-transition="None">
<h2>Prerequisites: Terminal</h2>
<ul>
<div>
<li>DataLad can be used from the command line</li>
<pre><code>datalad create mydataset</code></pre></div>
<div class="fragment fade-in">
<li>... or with its Python API</li>
<pre><code class="python">import datalad.api as dl
dl.create(path="mydataset")</code></pre></div>
<div class="fragment fade-in">
<li>... and other programming languages can use it via system call</li>
<pre><code class="python"># in R
> system("datalad create mydataset")
</code></pre></div>
<br><br>
</ul>
</section>
<section data-transition="None">
<h2>Prerequisites: Terminal</h2>
<iframe src="https://directpoll.com/r?XDbzPBdEt8j1rJlVwV5I4m6c9z8nJU2YLnRe3j3k",
style="border: 0" width="900" height="800"></iframe>
<p><a href="https://datalad-hub.inm7.de" target="_blank">
datalad-hub.inm7.de</a></p>
<p><a href="https://www.mathcs.emory.edu/~valerie/courses/fall10/155/resources/unix_cheatsheet.html" target="_blank">
Unix terminal cheatsheet (incl. Windows equivalents)</a></p>
</section>
<section data-transition="None">
<h2>Prerequisites: Installation and Configuration</h2>
<ul style="font-size:30px">
<li data-fragment-index="1" class="fragment fade-in">Your installed version of DataLad should be 0.17.2</li>
<pre class="fragment fade-in" data-fragment-index="1"><code data-fragment-index="1" class="fragment fade-in">datalad --version
0.17.2</code></pre>
<table>
<li data-fragment-index="2" class="fragment fade-in">DataLad relies on Git to create a revision history with detailed information on
what was changes, when, and how. Therefore, you should tell Git who you are and
configure a Git identity (name and email). Find out if an identity is set
by running either of:</li>
<tr>
<td>
<pre data-fragment-index="2" class="fragment fade-in"><code data-fragment-index="2" class="fragment fade-in" class="bash">$ git config --get user.name
Adina Wagner
$ git config --get user.email
adina.wagner@t-online.de .
</code></pre>
</td>
<td>
<pre data-fragment-index="2" class="fragment fade-in"><code data-fragment-index="2" class="fragment fade-in" class="bash">$ datalad configuration get user.name user.email
Adina Wagner
adina.wagner@t-online.de
.
</code></pre>
</td>
</tr>
</table>
<li data-fragment-index="3" class="fragment fade-in">Set a Git identity using either of
<table>
<tr>
<td>
<pre data-fragment-index="3" class="fragment fade-in"><code data-fragment-index="3" class="fragment fade-in">$ git config set --global \
user.name "Adina Wagner"
$ git config set --global \
user.email "adina.wagner@t-online.de" .</code></pre>
</td>
<td>
<pre data-fragment-index="3" class="fragment fade-in"><code data-fragment-index="3" class="fragment fade-in">$ datalad configuration --scope global \
set user.name="Adina Wagner"
$ datalad configuration --scope global \
set user.email="adina.wagner@t-online.de" .</code></pre>
</td>
</tr>
</table>
<li data-fragment-index="4" class="fragment fade-in">Allow brand-new DataLad functionality:
<pre><code>datalad configuration --scope global set datalad.extensions.load=next</code></pre> </li>
<small>Find installation and configuration
instructions at <a href="http://handbook.datalad.org/en/latest/intro/installation.html" target="_blank">
handbook.datalad.org</a></small>
</ul>
</section>
<section data-transition="None">
<h2>Prerequisites: Using DataLad</h2>
<ul style="font-size:30px">
<li class="fragment fade-in">Every DataLad command consists of a main
command followed by a sub-command. The main and the sub-command can have options.
<img height="280px" src="../pics/command-structure.png">
</li>
<li class="fragment fade-in"> Example (main command, subcommand, several subcommand options):
<pre><code>$ datalad save -m "Saving changes" --recursive </code></pre>
</li>
<li class="fragment fade-in">Use <em>--help</em> to find out more about any (sub)command
and its options, including detailed description and examples (<em>q</em> to close). Use <em>-h</em> to get a short
overview of all options
<pre><code>$ datalad save -h
Usage: datalad save [-h] [-m MESSAGE] [-d DATASET] [-t ID] [-r] [-R LEVELS]
[-u] [-F MESSAGE_FILE] [--to-git] [-J NJOBS] [--amend]
[--version]
[PATH ...]
Use '--help' to get more comprehensive information.
</code></pre></li>
</ul>
</section>
</section>
<section>
<section data-markdown><script type="text/template">
If everything is important...
...track everything!
</script></section>
<section data-markdown><script type="text/template">
![](../pics/datalad_logo_wide.svg)<!-- .element: height="600" -->
http://datalad.org<!-- .element: style="margin-left:800px" -->
<aside class="notes">
But let's not talk about it, and only talk about feature and example implementations in DataLad
</aside>
</script>
</section>
<section data-markdown data-transition="none"><script type="text/template">
## Exhaustive tracking of research components
![](../pics/vamp_0_start.png)<!-- .element: width="100%" -->
Well-structured datasets (using community standards), and portable computational environments &mdash; and their evolution &mdash; are the precondition for reproducibility
<table width=100% style="padding:0px">
<tr><td style="padding:0px">
<code><pre>
# turn any directory into a dataset
# with version control
% datalad create &lt;directory&gt;
</pre></code>
</td><td style="padding:0px">
<code><pre>
# save a new state of a dataset with
# file content of any size
% datalad save
</pre></code>
</td></tr></table>
Note:
- link to prev. statements on description standards
- your community could be really small (your lab), when data are precious resources
will be spent to understand it, but information must be capture to make this possible
</script></section>
<section data-markdown data-transition="none"><script type="text/template">
## Capture computational provenance
![](../pics/vamp_1_provcapture.png)<!-- .element: width="100%" -->
Which data was needed at which version, as input into which code, running with what parameterization in which
computional environment, to generate an outcome?
<table width=100% style="padding:0px">
<tr><td style="padding:0px">
<code><pre>
# execute any command and capture its output
# while recording all input versions too
% datalad run --input ... --output ... &lt;command&gt;
</pre></code>
</td></tr></table>
Note:
The missing link: even when everything is shared, we still don't know how to start.
README is minimum, but executable prov-records are much better.
</script></section>
<section data-markdown data-transition="none"><script type="text/template">
## Exhaustive capture enables portability
![](../pics/vamp_2_pushtocloud.png)<!-- .element: width="100%" -->
Precise identification of data and computational environments, combined for provenance records form a comprehensive and portable data structure, capturing all aspects of an investigation.
<table width=100% style="padding:0px">
<tr><td style="padding:0px">
<code><pre>
# transfer data and metadata to other sites and services
# with fine-grained access control for dataset components
% datalad push --to &lt;site-or-service&gt;
</pre></code>
</td></tr></table>
Note:
Does it fly? Can you give it to someone? Or can you take it with you to your new lab?
</script></section>
<section data-markdown data-transition="none"><script type="text/template">
## Reproducibility strengthens trust
![](../pics/vamp_3_reproduce.png)<!-- .element: width="100%" -->
Outcomes of computational transformations can be validated by authorized 3rd-parties. This enables audits, promotes accountability, and streamlines automated "upgrades" of outputs
<table width=100% style="padding:0px">
<tr><td style="padding:0px">
<code><pre>
# obtain dataset (initially only identity,
# availability, and provenance metadata)
% datalad clone &lt;url&gt;
</pre></code>
</td><td style="padding:0px">
<code><pre>
# immediately actionable provenance records
# full abstraction of input data retrieval
% datalad rerun &lt;commit|tag|range&gt;
</pre></code>
</td></tr></table>
Note:
Goal is automated reproducibility, enables assessment of robustness and benchmarking algorithmic developments
</script></section>
<section data-markdown data-transition="none"><script type="text/template">
## Ultimate goal: (re-)usability
![](../pics/vamp_4_reuse.png)<!-- .element: width="100%" -->
Verifiable, portable, self-contained data structures that track all aspects of an investigation exhaustively can be (re-)used as modular components in larger contexts &mdash; propagating their traits
<table width=100% style="padding:0px">
<tr><td style="padding:0px">
<code><pre>
# declare a dependency on another dataset and
# re-use it a particular state in a new context
% datalad clone -d &lt;superdataset&gt; &lt;url&gt; &lt;path-in-dataset&gt;
</pre></code>
</td></tr></table>
Note:
With these in place, re-usability is a small(er) step
</script></section>
<section data-markdown><script type="text/template">
## DataLad: Manage (co-)evolution of digital objects
![](../pics/yoda_decentralized_publishing.png)<!-- .element: width="900" style="margin-bottom:-70px;margin-top:-20px" -->
Consume, create, curate, analyze, publish, and query data with full provenance capture and "universal" metadata support.
<p style="font-size:70%;margin-top:-20px">
DataLad is free and open source (MIT-licensed). http://datalad.org
</p>
<note>
Halchenko, Meyer, Poldrack, ... & Hanke, M. (2021).
DataLad: distributed system for joint management of code, data, and their relationship.
Journal of Open Source Software, 6(63), 3262.
</note>
Note:
- following illustrations contain concrete implementation with datalad
- Software developed to address the needs of long-term maintenance and collab on the stufyforrest dataset
</script></section>
<section data-markdown><script type="text/template">
## Let's try...
</script></section>
</section>
<section>
<section>
<h1>Backup</h1>
</section>
<section>
<h2>Core concepts & features</h2>
</section>
<section>
<h2>Everything happens in DataLad datasets</h2>
<img src="../pics/artwork/src/dataset.svg" width="600"> <br>
</section>
<section>
<h2>Dataset = Git/git-annex repository</h2>
<ul>
<li>content agnostic</li>
<li>no custom data structures</li>
<li>complete decentralization</li>
<li>Looks and feels like a directory on your computer:</li>
</ul>
<br>
<br>
<img src="../pics/remodnav-ds-nautilus.png" width="500"> <img src="../pics/remodnav-ds-terminal.png" width="500">
<small>File viewer and terminal view of a DataLad dataset</small>
</section>
<section>
<h2>version control arbitrarily large files</h2>
<img src="../pics/artwork/src/local_wf.svg" width="600"> <br>
<ul><p class="fragment fade-in">
Stay flexible:
<li class="fragment fade-in">Non-complex DataLad core API (easy for data management novices)</li>
<li class="fragment fade-in">Pure Git or git-annex commands (for regular Git or git-annex users, or to use specific functionality)</li>
</ul></p>
</section>
<section>
<h2>Use a datasets' history</h2>
<img src="../pics/researchlog.png">
<ul>
<li class="fragment fade-in"> reset your dataset (or subset of it) to a previous state, </li>
<li class="fragment fade-in"> revert changes or bring them back, </li>
<li class="fragment fade-in"> find out what was done when, how, why, and by whom </li>
<li class="fragment fade-in"> Identify precise versions: Use data in the most recent version, or the one from 2018, or... </li>
</ul>
</section>
<section>
<h2>Consume and collaborate</h2>
<img src="../pics/artwork/src/collaboration.svg" width="900"> <br>
</section>
<section>
<h2>machine-readable, re-executable provenance</h2>
<img src="../pics/artwork/src/reproducible_execution.svg" width="900"> <br>
</section>
<section>
<h2>Seamless nesting and dataset linkage</h2>
<img src="../pics/artwork/src/linkage_subds.svg" width="900"> <br>
<!-- <ul>
<li class="fragment fade-in" data-fragment-index="2">Overcomes scaling issues with large amounts of files</li>
<pre class="fragment fade-in" data-fragment-index="2"><code>adina@bulk1 in /ds/hcp/super on git:master❱ datalad status --annex -r
15530572 annex'd files (77.9 TB recorded total size)
nothing to save, working tree clean</code></pre>
<small><a class="fragment fade-in" data-fragment-index="2" href="https://github.com/datalad-datasets/human-connectome-project-openaccess" target="_blank">(github.com/datalad-datasets/human-connectome-project-openaccess)</a></small>
<li class="fragment fade-in">Modularizes research components for transparency, reuse, and access management</li>
</ul>
-->
</section>
<section>
<h2>Core concepts & features</h2>
</section>
<section>
<h2>Everything happens in DataLad datasets</h2>
<img src="../pics/artwork/src/dataset.svg" width="600"> <br>
</section>
<section>
<h2>Dataset = Git/git-annex repository</h2>
<ul>
<li>content agnostic</li>
<li>no custom data structures</li>
<li>complete decentralization</li>
<li>Looks and feels like a directory on your computer:</li>
</ul>
<br>
<br>
<img src="../pics/remodnav-ds-nautilus.png" width="500"> <img src="../pics/remodnav-ds-terminal.png" width="500">
<small>File viewer and terminal view of a DataLad dataset</small>
</section>
<section>
<h2>version control arbitrarily large files</h2>
<img src="../pics/artwork/src/local_wf.svg" width="600"> <br>
<ul><p class="fragment fade-in">
Stay flexible:
<li class="fragment fade-in">Non-complex DataLad core API (easy for data management novices)</li>
<li class="fragment fade-in">Pure Git or git-annex commands (for regular Git or git-annex users, or to use specific functionality)</li>
</ul></p>
</section>
<section>
<h2>Use a datasets' history</h2>
<img src="../pics/researchlog.png">
<ul>
<li class="fragment fade-in"> reset your dataset (or subset of it) to a previous state, </li>
<li class="fragment fade-in"> revert changes or bring them back, </li>
<li class="fragment fade-in"> find out what was done when, how, why, and by whom </li>
<li class="fragment fade-in"> Identify precise versions: Use data in the most recent version, or the one from 2018, or... </li>
</ul>
</section>
<section>
<h2>Consume and collaborate</h2>
<img src="../pics/artwork/src/collaboration.svg" width="900"> <br>
</section>
<section>
<h2>machine-readable, re-executable provenance</h2>
<img src="../pics/artwork/src/reproducible_execution.svg" width="900"> <br>
</section>
<section>
<h2>Seamless nesting and dataset linkage</h2>
<img src="../pics/artwork/src/linkage_subds.svg" width="900"> <br>
<!-- <ul>
<li class="fragment fade-in" data-fragment-index="2">Overcomes scaling issues with large amounts of files</li>
<pre class="fragment fade-in" data-fragment-index="2"><code>adina@bulk1 in /ds/hcp/super on git:master❱ datalad status --annex -r
15530572 annex'd files (77.9 TB recorded total size)
nothing to save, working tree clean</code></pre>
<small><a class="fragment fade-in" data-fragment-index="2" href="https://github.com/datalad-datasets/human-connectome-project-openaccess" target="_blank">(github.com/datalad-datasets/human-connectome-project-openaccess)</a></small>
<li class="fragment fade-in">Modularizes research components for transparency, reuse, and access management</li>
</ul>
-->
</section>
<section>
<h2>Third party integrations</h2>
<img src="../pics/artwork/src/thirdparty.svg" width="900"> <br>
<small>Apart from <b>local computing infrastructure</b> (from private laptops to computational clusters),
datasets can be hosted in major <b>third party repository hosting and cloud storage</b> services.
More info: Chapter on <a href="http://handbook.datalad.org/en/latest/basics/basics-thirdparty.html" target="_blank">
Third party infrastructure</a>.</small>
</section>
<section>
<h2>Third party integrations</h2>
<img src="../pics/artwork/src/thirdparty.svg" width="900"> <br>
<small>Apart from <b>local computing infrastructure</b> (from private laptops to computational clusters),
datasets can be hosted in major <b>third party repository hosting and cloud storage</b> services.
More info: Chapter on <a href="http://handbook.datalad.org/en/latest/basics/basics-thirdparty.html" target="_blank">
Third party infrastructure</a>.</small>
</section>
</section>
</div>
</div>
<script src="../reveal.js/dist/reveal.js"></script>
<script src="../reveal.js/plugin/notes/notes.js"></script>
<script src="../reveal.js/plugin/markdown/markdown.js"></script>
<script src="../reveal.js/plugin/highlight/highlight.js"></script>
<script>
// More info about initialization & config:
// - https://revealjs.com/initialization/
// - https://revealjs.com/config/
Reveal.initialize({
hash: true,
// The "normal" size of the presentation, aspect ratio will be preserved
// when the presentation is scaled to fit different resolutions. Can be
// specified using percentage units.
width: 1280,
height: 960,
// Factor of the display size that should remain empty around the content
margin: 0.2,
// Bounds for smallest/largest possible scale to apply to content
minScale: 0.2,
maxScale: 1.0,
controls: true,
progress: true,
history: true,
center: true,
slideNumber: 'c',
pdfSeparateFragments: false,
pdfMaxPagesPerSlide: 1,
pdfPageHeightOffset: -1,
transition: 'slide', // none/fade/slide/convex/concave/zoom
// Learn about plugins: https://revealjs.com/plugins/
plugins: [ RevealMarkdown, RevealHighlight, RevealNotes ]
});
</script>
</body>
</html>