datalad-course/html/datamanagement01.html

408 lines
14 KiB
HTML

<!doctype html>
<html>
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no">
<!-- Edit me start! -->
<title>This is where your title goes</title>
<meta name="description" content=" This is where you put a short description ">
<meta name="author" content=" Your Name ">
<!-- Edit me end! -->
<link rel="stylesheet" href="../reveal.js/dist/reset.css">
<link rel="stylesheet" href="../reveal.js/dist/reveal.css">
<link rel="stylesheet" href="../reveal.js/dist/theme/beige.css">
<!-- Theme used for syntax highlighted code -->
<link rel="stylesheet" href="../reveal.js/plugin/highlight/monokai.css">
</head>
<body>
<div class="reveal">
<div class="slides">
<section>
<h1 style="text-transform:none">Data management</h1> <br>
<h3 style="text-transform:none">Session 01</h3>
</section>
<section data-markdown><script type="text/template">
## Agenda ![](../pics/agenda.svg) <!-- .element: height="60" -->
- **Prerequisites & technicalities**
- Setting up a Git identity
- Gitlab/Github accounts
- Howto: Issues on Gitlab, The handbook's repository, Readthedocs
- **Session 01**
- What is a DataLad dataset?
- YODA principles for dataset organization
- **Hands-on 01:** Data in a DataLad dataset
<imgcredit>agenda by Aneeque Ahmed from the Noun Project</imgcredit>
<aside class="notes">
- just giving an overview what will happen today
</aside>
</script>
</section>
<section>
<section data-markdown><script type="text/template">
## Git identity ![](../pics/identity.svg)<!-- .element: height="60" -->
- Git: free & open source version control software
- Git is already installed on the cluster
- Git identity: The name and e-mail address associated with what you "save".
- Configuration with `git config` command in a terminal:
<pre><code class="bash" style="max-height:none">$ git config --global user.name "Adina Wagner"
$ git config --global user.email adina.wagner@example.com
</code></pre>
- **Hands-on**: Log into `brainbfast` and configure your Git identity, if it isn't set up yet.
<imgcredit>identity by Maxim Basinski from the Noun Project</imgcredit>
<aside class="notes">
- everyone needs to setup Git identity on cluster
</aside>
</script>
</section>
<section data-markdown><script type="text/template">
## Github & Gitlab
- Two different web-based Git repository managers with similar features. Both are used to host Git repositories and to ease collaboration.
**GitHub** ![](../pics/GitHub.png)<!-- .element: height="60" -->
- <a href="https://Github.com">Github.com</a>: Most popular, proprietary, extensive functionality on free plans
- Core concepts: Repositories, organizations.
**GitLab** ![](../pics/GitLab_Logo.svg)<!-- .element: height="60" -->
- Similar to GitHub, but is open source. The FZJ hosts many different GitLab instances.
- JuGit (<a href="https://jugit.fz-juelich.de/">jugit.fz-juelich.de</a>): The GitLab instance we recommend.
- Core concepts: Groups, subgroups, projects.
<aside class="notes">
- Key differences between the two. TODO: talk to Alex about permissions
</aside>
</script>
</section>
<section data-markdown><script type="text/template">
## Handbook repositories ![](../pics/logo.svg)<!-- .element: height="100" -->
- The DataLad Handbook: User-oriented, introductory course on DataLad and the
basis for the data management course.
- Source code on
- Github (<a href="https://github.com/datalad-handbook/book">github.com/datalad-handbook/book</a>) and
- Gitlab (<a href="https://jugit.fz-juelich.de/inm7/training/datalad-handbook" target="_blank">jugit.fz-jeulich.de/inm7/training/datalad-handbook</a>)
- **File issues** if you have questions or requests!
- Contribute by **pull requesting** changes, additions, and fixes!
<aside class="notes">
-
</aside>
</script>
</section>
<section data-markdown><script type="text/template">
## Filing issues on Gitlab ![](../pics/logo.svg)<!-- .element: height="100" -->
- File issues if you have DataLad-related or course-related questions in the repository hosted on **Gitlab**
- **Hands-on**: File an issue right now!
- go to <a href="https://jugit.fz-juelich.de/" target="_blank">jugit.fz-juelich.de</a>
- find the handbook project
- file an issue with any content
<aside class="notes">
-
</aside>
</script>
</section>
<section data-markdown><script type="text/template">
## Readthedocs ![](../pics/logo.svg)<!-- .element: height="100" -->
- The book is rendered with <a href="http://www.sphinx-doc.org/en/master/" target="_blank">Sphinx</a>
and hosted on <a href="https://readthedocs.org/" target="_blank">Readthedocs.org</a>
- Readthedocs supports HTML, eReader, and PDF formats
- Rendered version: <a href="http://handbook.datalad.org" target="_blank">handbook.datalad.org</a>
- There is an additional **INM-7 specific** version with additional sections on internal workflows
- **Hands-on**: Access public and INM-7-specific versions of the handbook in HTML and PDF format
<aside class="notes">
- talk about contributors, and PRs
</aside>
</script>
</section>
</section>
<section>
<section>
<h2>DataLad Datasets</h2>
<p align="left">
DataLad datasets are DataLad's core data structure. Datasets have many features:
</p>
<dl>
<dt>Version controlled content, regardless of size</dt><dd>Relying
on the tools Git and Git-annex working in the background.</dd>
<dt>Provenance tracking</dt><dd>Record and find out how data came into
existence (including the software environment), and reproduce entire analyses.</dd>
<dt>Easy collaboration</dt><dd>Install others' datasets, share datasets,
publish datasets with third-party services.</dd>
<dt>Staying up to date</dt><dd> Datasets can know their copies or origins.
This allows to <b>update</b> datasets from their sources with a single command.</dd>
<dt>Modularity & Nesting</dt><dd>Individual datasets are independent,
versioned components that can be <i>nested</i> as <i>subdatasets</i> in
<i>superdatasets</i>. Subdatasets have a stand-alone version history, and their
<i>version state</i> is recorded in the superdataset. </dd>
</dl>
</section>
<section>
<h2>DataLad Datasets</h2>
<p align="left">
DataLad datasets look like any other directory on your computer, and subdatasets
look like subdirectories. DataLad, Git-annex, and Git work in the background
(e.g., <code>.datalad/</code>, <code>.git/</code>, ...).
</p>
<img style="" data-src="../pics/virtual_dirtree.png" height="600">
<p align="left">
You can <b>create & populate</b>
a dataset from scratch, or <b>install</b> existing datasets from collaborators
or open sources.
</p>
</section>
<section>
<h2>DataLad Datasets for data analysis</h2>
<ul>
<li>A DataLad dataset can have <i>any</i> structure, and use as many or few
features of a dataset as required.</li>
<li>However, for <b>data analyses</b> it is beneficial to make
use of DataLad features and structure datasets according to the <b>YODA principles</b>:</li>
</ul>
<img style="" data-src="../pics/yoda.png" height="400">
<dl>
<dt>P1: One thing, one dataset</dt>
<dt>P2: Record where you got it from, and where it is now</dt>
<dt>P3: Record what you did to it, and with what</dt>
</dl>
</section>
<section data-markdown>
## P1: One thing, one dataset
![](../pics/dataset_modules.png)
- Bundle all components of one analysis into one superdataset.
- Whenever a particular collection of files could anyhow be useful in more
than one context (e.g. data), put them in their own dataset, and install it as
a subdataset.
- Keep everything clean and modular: Within an analysis, separate code, data, output, execution environments.
</section>
<section data-markdown>
## P2: Record where you got it from, and where it is now
![](../pics/data_origin.png)
- Link individual datasets to declare data-dependencies (e.g. as subdatasets).
- Record data's orgin with appropriate commands, for example
to record access URLs for individual files obtained from (unstructured) sources "in the cloud".
- Keep a dataset self-contained with relative paths in scripts to subdatasets or files.
- Share and publish datasets to collaborate.
</section>
<section data-markdown>
## P3: Record what you did to it, and with what
![](../pics/dataset_linkage_provenance.png)
- Collect and store provenance of all contents of a dataset that you create
(more on this in later sessions).
</section>
</section>
<section>
<section data-markdown><script type="text/template">
## Hands-on excersise
**Objective**: How would you get data into a dataset?
- Use <a href=https://github.com/datalad/example-dicom-functional target="_blank">github.com/datalad/example-dicom-functional</a>
as test data. Download branch `1block` as a **ZIP archive**.
- Log into `brainbfast`, get the data on `brainbfast`, and try to get
this data into a DataLad dataset with a sensible structure suitable for data analysis.
- This excersise is meant for **exploration**:
- use `datalad --help`, the handbook, or the documentation at
<a href=http://docs.datalad.org/en/latest target="_blank">docs.datalad.org</a>
to find out about available commands to solve this task,
- use tools of your choice to download/extract data,
- try to set up an appropriate dataset structure.
</script>
</section>
<section>
<h2>Hands-on solution</h2>
<ul>
<li>transform the zip folder into a DataLad dataset:</li>
</ul>
<pre><code class="bash" style="max-height:none">$ cd example_dicom_functional_block
$ datalad create -f
[INFO ] Creating a new annex repo at [...]/example-dicom-functional-1block
create(ok): [...]example-dicom-functional-1block (dataset)
$ datalad save -m "add dicoms from functional acquisition" .
add(ok): LICENSE (file)
add(ok): dicoms/MR.1.3.46.670589.11.38317.5.0.4476.2014042516045740754 (file) [...]
</code></pre>
<ul>
<li>create a dataset for a data analysis (independent from the data directory)</li>
</ul>
<pre><code class="bash" style="max-height:none">$ cd ../
$ datalad create -c yoda myanalysis
[INFO ] Creating a new annex repo at [...]/myanalysis
[INFO ] Running procedure cfg_yoda
[INFO ] == Command start (output follows) =====
[INFO ] == Command exit (modification check follows) =====
create(ok): [...]/myanalysis (dataset)
</code></pre>
<ul>
<li>create a data directory and install the dicom dataset as a subdataset</li>
</ul>
<pre><code class="bash" style="max-height:none">$ cd myanalysis
$ mkdir data
$ datalad install -d . -s ../example_dicom_functional_1block data/dicoms
[INFO ] Cloning ../example-dicom-functional-1block into '[...]/myanalysis/data/dicoms'
install(ok): data/dicoms (dataset)
action summary:
add (ok: 2)
install (ok: 1)
save (ok: 1)
</code></pre>
<b>Hands-on</b>: Explore this dataset
<aside class="notes">
- wget https://github.com/datalad/example-dicom-functional/archive/1block.zip
- unzip archive
- force-create ds out of it; save
- install from path
explore: content lock; stuff saved in Git (code) versus stuff saved in annex; get
</aside>
</section>
<section data-markdown><script type="text/template">
## Further reading
You will find the topics of this session in more detail in the following chapters of the handbook:
- **The basics on datasets:**
- Chapter <a href=http://handbook.datalad.org/en/latest/index.html target="_blank">DataLad Datasets</a> in the handbook.
- **Best practices for data analyses in datasets (YODA):**
- The section <a href=http://handbook.datalad.org/en/latest/basics/101-123-yoda.html target="_blank">YODA principles</a> in the handbook.
- **A preview into automatically reproducible analyses in datasets:**
- Usecase <a href=http://handbook.datalad.org/en/latest/usecases/reproducible_neuroimaging_analysis.html target="_blank">"An automatically reproducible neuroimaging analysis of public data"</a> in the handbook.
</script>
</section>
</section>
<section>
<section data-markdown><script type="text/template">
## Outline: What comes next?
- DataLad is installed on the cluster, try it out further, and ask questions on GitLab.
- Sessions will start with open question time about a past excersise, and end with an
excersise for the upcoming session.
- Upcoming topics: Reproducible analysis, collaboration, INM-7 specific workflows
on data retrieval & JSC.
- **Which date is suitable?** > <a href="https://doodle.com/poll/gavpbrtvzqtmy6up" target="_blank">Doodle poll</a> <
</script>
</section>
<section data-markdown>
# Questions?
</section>
</section>
</div>
</div>
<script src="../reveal.js/dist/reveal.js"></script>
<script src="../reveal.js/plugin/notes/notes.js"></script>
<script src="../reveal.js/plugin/markdown/markdown.js"></script>
<script src="../reveal.js/plugin/highlight/highlight.js"></script>
<script>
// More info about initialization & config:
// - https://revealjs.com/initialization/
// - https://revealjs.com/config/
Reveal.initialize({
hash: true,
// The "normal" size of the presentation, aspect ratio will be preserved
// when the presentation is scaled to fit different resolutions. Can be
// specified using percentage units.
width: 1280,
height: 960,
// Factor of the display size that should remain empty around the content
margin: 0.3,
// Bounds for smallest/largest possible scale to apply to content
minScale: 0.2,
maxScale: 1.0,
controls: true,
progress: true,
history: true,
center: true,
slideNumber: 'c',
pdfSeparateFragments: false,
pdfMaxPagesPerSlide: 1,
pdfPageHeightOffset: -1,
transition: 'slide', // none/fade/slide/convex/concave/zoom
// Learn about plugins: https://revealjs.com/plugins/
plugins: [ RevealMarkdown, RevealHighlight, RevealNotes ]
});
</script>
</body>
</html>