datalad-handbook/docs/basics/101-127-yoda.rst

.. index:: ! YODA principles
.. _2-001:
.. _yoda:

YODA: Best practices for data analyses in a dataset
---------------------------------------------------

The last requirement for the midterm projects reads "needs to comply to the
YODA principles".
"What are the YODA principles?" you ask, as you have never heard of this
before.
"The topic of today's lecture: Organizational principles of data
analyses in DataLad datasets. This lecture will show you the basic
principles behind creating, sharing, and publishing reproducible,
understandable, and open data analysis projects with DataLad.", you
hear in return.

The starting point...
^^^^^^^^^^^^^^^^^^^^^

Data analyses projects are very common, both in science and industry.
But it can be very difficult to produce a reproducible, let alone
*comprehensible* data analysis project.
Many data analysis projects do not start out with
a stringent organization, or fail to keep the structural organization of a
directory intact as the project develops. Often, this can be due to a lack of
version-control. In these cases, a project will quickly end up
with many
`almost-identical scripts suffixed with "_version_xyz" <https://phdcomics.com/comics/archive.php?comicid=1531>`_,
or a chaotic results structure split between various directories with names
such as ``results/``, ``results_August19/``, ``results_revision/`` and
``now_with_nicer_plots/``. Something like this is a very
common shape a data science project may take after a while:

.. code-block:: console

    ├── code/
    │   ├── code_final/
    │   │   ├── final_2/
    │   │   │   ├── main_script_fixed.py
    │   │   │   └──takethisscriptformostthingsnow.py
    │   │   ├── utils_new.py
    │   │   ├── main_script.py
    │   │   ├── utils_new.py
    │   │   ├── utils_2.py
    │   │   └── main_analysis_newparameters.py
    │   └── main_script_DONTUSE.py
    ├── data/
    │   ├── data_updated/
    │   │   └── dataset1/
    │   │       └── datafile_a
    │   ├── dataset1/
    │   │    └── datafile_a
    │   ├── outputs/
    │   │   ├── figures/
    │   │   │   ├── figures_new.py
    │   │   │   └── figures_final_forreal.py
    │   │   ├── important_results/
    │   │   ├── random_results_file.tsv
    │   │   ├── results_for_paper/
    │   │   ├── results_for_paper_revised/
    │   │   └── results_new_data/
    │   ├── random_results_file.tsv
    │   ├── random_results_file_v2.tsv

    [...]

All data analysis endeavors in directories like this *can* work, for a while,
if there is a person who knows the project well, and works on it all the time.
But it inevitably will get messy once anyone tries to collaborate on a project
like this, or simply goes on a two-week vacation and forgets whether
the function in ``main_analysis_newparameters.py`` or the one in
``takethisscriptformostthingsnow.py`` was the one that created a particular figure.

But even if a project has an intuitive structure, and *is* version
controlled, in many cases an analysis script will stop working, or maybe worse,
will produce different results, because the software and tools used to
conduct the analysis in the first place got an update. This update may have
come with software changes that made functions stop working, or work differently
than before.
In the same vein, recomputing an analysis project on a different machine than
the one the analysis was developed on can fail if the necessary
software in the required versions is not installed or available on this new machine.
The analysis might depend on software that runs on a Linux machine, but the project
was shared with a Windows user. The environment during analysis development used
Python 2, but the new system has only Python 3 installed. Or one of the dependent
libraries needs to be in version X, but is installed as version Y.

The YODA principles are a clear set of organizational standards for
datasets used for data analysis projects that aim to overcome issues like the
ones outlined above. The name stands for
"YODAs Organigram on Data Analysis" [#f1]_. The principles outlined
in YODA set simple rules for directory names and structures, best-practices for
version-controlling dataset elements and analyses, facilitate
usage of tools to improve the reproducibility and accountability
of data analysis projects, and make collaboration easier.
They are summarized in three basic principles, that translate to both
dataset structures and best practices regarding the analysis:

- :ref:`P1`

- :ref:`P2`

- :ref:`P3`

As you will see, complying to these principles is easy if you
use DataLad. Let's go through them one by one:

.. _P1:

P1: One thing, one dataset
^^^^^^^^^^^^^^^^^^^^^^^^^^

Whenever a particular collection of files could be useful in more
than one context, make them a standalone, modular component.
In the broadest sense, this means to structure your study elements (data, code,
computational environments, results, ...) in dedicated directories. For example:


- Store **input data** for an analysis in a dedicated ``inputs/`` directory.
  Keep different formats or processing-stages of your input data as individual,
  modular components:  Do not mix raw data, data that is already structured
  following community guidelines of the given field, or preprocessed data, but create
  one data component for each of them. And if your analysis
  relies on two or more data collections, these collections should each be an
  individual component, not combined into one.

- Store scripts or **code** used for the analysis of data in a dedicated ``code/``
  directory, outside of the data component of the dataset.

- Collect **results** of an analysis in a dedicated place, outside of the ``inputs/`` directory, and
  leave the input data of an analysis untouched by your computations. (Ideally, don't create an "``outputs``" directory for this - when your dataset becomes an input for another, the resulting file paths would be something like ``/inputs/outputs/somefile.dat``)

- Include a place for complete **execution environments**, such as
  `singularity images <https://singularity.lbl.gov>`_ or
  `docker containers <https://www.docker.com/get-started>`_ [#f2]_, in
  the form of an ``envs/`` directory, if relevant for your analysis.

- And if you conduct multiple different analyses, create a dedicated
  project for each analysis, instead of conflating them.

This, for example, would be a directory structure from the root of a
superdataset of a very comprehensive data analysis project complying to the YODA principles:

.. code-block:: console

    ├── ci/                         # continuous integration configuration
    │   └── .travis.yml
    ├── code/                       # your code
    │   ├── tests/                  # unit tests to test your code
    │   │   └── test_myscript.py
    │   └── myscript.py
    ├── docs                        # documentation about the project
    │   ├── build/
    │   └── source/
    ├── envs                        # computational environments
    │   └── Singularity
    ├── inputs/                     # dedicated inputs/, will not be changed by an analysis
    │   └─── data/
    │       ├── dataset1/           # one stand-alone data component
    │       │   └── datafile_a
    │       └── dataset2/
    │           └── datafile_a
    ├── important_results/          # outputs away from the input data
    │   └── figures/
    ├── CHANGELOG.md                # notes for fellow humans about your project
    ├── HOWTO.md
    └── README.md

You can get a few non-DataLad related advice for structuring your directories in the :ref:`on best practices for analysis organization <fom-yodaproject>`.

.. index::
   pair: recommendation; dataset content organization
.. find-out-more:: More best practices for organizing contents in directories
   :name: fom-yodaproject
   :float:

   The exemplary YODA directory structure is very comprehensive, and displays many best-practices for
   reproducible data science. For example,

   #. Within ``code/``, it is best practice to add **tests** for the code.
      These tests can be run to check whether the code still works.

   #. It is even better to further use automated computing such as
      `continuous integration (CI) systems <https://en.wikipedia.org/wiki/Continuous_integration>`_,
      to test the functionality of your functions and scripts automatically.
      If relevant, the setup for continuous integration frameworks (such as
      `Appveyor <https://www.appveyor.com>`_) lives outside of ``code/``,
      in a dedicated ``ci/`` directory.

   #. Include **documents for fellow humans**: Notes in a README.md or a HOWTO.md,
      or even proper documentation (for example, using  in a dedicated ``docs/`` directory.
      Within these documents, include all relevant metadata for your analysis. If you are
      conducting a scientific study, this might be authorship, funding,
      change log, etc.

   If writing tests for analysis scripts or using continuous integration
   is a new idea for you, but you want to learn more, check out
   `this chapter on testing <https://the-turing-way.netlify.app/reproducible-research/testing>`_.

There are many advantages to this modular way of organizing contents.
Having input data as independent components that are not altered (only
consumed) by an analysis does not conflate the data for
an analysis with the results or the code, thus assisting understanding
the project for anyone unfamiliar with it.
But more than just structure, this organization aids modular reuse or
publication of the individual components, for example data. In a
YODA-compliant dataset, any processing stage of a data component can
be reused in a new project or published and shared. The same is true
for a whole analysis dataset. At one point you might also write a
scientific paper about your analysis in a paper project, and the
whole analysis project can easily become a modular component in a paper
project, to make sharing paper, code, data, and results easy.
The use case :ref:`usecase_reproducible_paper` contains a step-by-step instruction on
how to build and share such a reproducible paper, if you want to learn
more.


.. figure:: ../artwork/src/img/dataset_modules.svg
   :width: 100%
   :name: dataset_modules
   :alt: Modular structure of a data analysis project

   Data are modular components that can be reused easily.

The directory tree above and :numref:`dataset_modules` highlight different aspects
of this principle. The directory tree illustrates the structure of
the individual pieces on the file system from the point of view of
a single top-level dataset with a particular purpose. For example, it
could be an analysis dataset created by a statistician for a scientific
project, and it could be shared between collaborators or
with others during development of the project. In this
superdataset, code is created that operates on input data to
compute outputs, and the code and outputs are captured,
version-controlled, and linked to the input data. Each input data in turn
is a (potentially nested) subdataset, but this is not visible
in the directory hierarchy.
:numref:`dataset_modules`, in comparison, emphasizes a process view on a project and
the nested structure of input subdataset:
You can see how the preprocessed data that serves as an input for
the analysis datasets evolves from raw data to
standardized data organization to its preprocessed state. Within
the ``data/`` directory of the file system hierarchy displayed
above one would find data datasets with their previous version as
a subdataset, and this is repeated recursively until one reaches
the raw data as it was originally collected at one point. A finished
analysis project in turn can be used as a component (subdataset) in
a paper project, such that the paper is a fully reproducible research
object that shares code, analysis results, and data, as well as the
history of all of these components.

Principle 1, therefore, encourages to structure data analysis
projects in a clear and modular fashion that makes use of nested
DataLad datasets, yielding comprehensible structures and reusable
components. Having each component version-controlled --
regardless of size --  will aid keeping directories clean and
organized, instead of piling up different versions of code, data,
or results.

.. _P2:

P2: Record where you got it from, and where it is now
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

It is good to have data, but it is even better if you and anyone you
collaborate or share the project or its components with can find
out where the data came from, or how it
is dependent on or linked to other data. Therefore, this principle
aims to attach this information, the data's :term:`provenance`, to the components of
your data analysis project.

Luckily, this is a no-brainer with DataLad, because the core data structure
of DataLad, the dataset, and many of the DataLad commands already covered
up to now fulfill this principle.

If data components of a project are DataLad datasets, they can
be included in an analysis superdataset as subdatasets. Thanks to
:dlcmd:`clone`, information on the source of these subdatasets
is stored in the history of the analysis superdataset, and they can even be
updated from those sources if the original data dataset gets extended or changed.
If you are including a file, for example, code from GitHub,
the :dlcmd:`download-url` command, introduced in section :ref:`populate`,
will record the source of it safely in the dataset's history. And if you add anything to your dataset,
from simple incremental coding progress in your analysis scripts up to
files that a colleague sent you via email, a plain :dlcmd:`save`
with a helpful commit message goes a very long way to fulfill this principle
on its own already.

One core aspect of this principle is *linking* between reusable data
resource units (i.e., DataLad subdatasets containing pure data). You will
be happy to hear that this is achieved by simply installing datasets
as subdatasets, as :numref:`fig-subds` shows.
This part of this principle will therefore be absolutely obvious to you
because you already know how to install and nest datasets within datasets.
"I might just overcome my impostor syndrome if I experience such advanced
reproducible analysis concepts as being obvious", you think with a grin.

.. _fig-subds:

.. figure:: ../artwork/src/img/data_origin.svg
   :width: 50%
   :alt: Datasets are installed as subdatasets

   Schematic illustration of two standalone data datasets installed as subdatasets
   into an analysis project.

But more than linking datasets in a superdataset, linkage also needs to
be established between components of your dataset. Scripts inside of
your ``code/`` directory should point to data not as :term:`absolute path`\s
that would only work on your system, but instead as :term:`relative path`\s
that will work in any shared copy of your dataset. The next section
demonstrates a YODA data analysis project and will show concrete examples of this.

Lastly, this principle also includes *moving*, *sharing*, and *publishing* your
datasets or its components.
It is usually costly to collect data, and economically unfeasible [#f4]_ to keep
it locked in a drawer (or similarly out of reach behind complexities of
data retrieval or difficulties in understanding the data structure).
But conducting several projects on the same dataset yourself, sharing it with
collaborators, or publishing it is easy if the project is a DataLad dataset
that can be installed and retrieved on demand, and is kept clean from
everything that is not part of the data according to principle 1.
Conducting transparent open science is easier if you can link code, data,
and results within a dataset, and share everything together. In conjunction
with principle 1, this means that you can distribute your analysis projects
(or parts of it) in a comprehensible form, exemplified in :numref:`fig-yodads`.

.. _fig-yodads:

.. figure:: ../artwork/src/img/decentralized_publishing.svg
   :figwidth: 100%
   :alt: A full data analysis workflow complying with YODA principles

   In a dataset that complies to the YODA principles, modular components
   (data, analysis results, papers) can be shared or published easily.

Principle 2, therefore, facilitates transparent linkage of datasets and their
components to other components, their original sources, or shared copies.
With the DataLad tools you learned to master up to this point,
you have all the necessary skills to comply to it already.

.. _P3:

P3: Record what you did to it, and with what
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This last principle is about capturing *how exactly the content of
every file came to be* that was not obtained from elsewhere. For example,
this relates to results generated from inputs by scripts or commands.
The section :ref:`run` already outlined the problem of associating
a result with an input and a script. It can be difficult to link a
figure from your data analysis project with an input data file or a
script, even if you created this figure yourself.
The :dlcmd:`run` command however mitigates these difficulties,
and captures the provenance of any output generated with a
``datalad run`` call in the history of the dataset. Thus, by using
:dlcmd:`run` in analysis projects, your dataset knows
which result was generated when, by which author, from which inputs,
and by means of which command.

With another DataLad command one can even go one step further:
The command :dlcmd:`containers-run` - it will be introduced in
section :ref:`containersrun` - performs a command execution within
a configured containerized environment. Thus, not only inputs,
outputs, command, time, and author, but also the *software environment*
are captured as provenance of a dataset component such as a results file,
and, importantly, can be shared together with the dataset in the
form of a software container.

Tip: Make use of ``datalad run``'s ``--dry-run`` option to craft your run-command, as outlined in :ref:`dryrun`!

With this last principle, your dataset collects and stores provenance
of all the contents you created in the wake of your analysis project.
This established trust in your results, and enables others to understand
where files derive from.

.. _yodaproc:

The YODA procedure
^^^^^^^^^^^^^^^^^^

There is one tool that can make starting a yoda-compliant data analysis
easier: DataLad's ``yoda`` procedure. Just as the ``text2git`` procedure
from section :ref:`createds`, the ``yoda`` procedure can be included in a
:dlcmd:`create` command and will apply useful configurations
to your dataset:

.. code-block:: console

   $ datalad create -c yoda "my_analysis"

   [INFO   ] Creating a new annex repo at /home/me/repos/testing/my_analysis
   create(ok): /home/me/repos/testing/my_analysis (dataset)
   [INFO   ] Running procedure cfg_yoda
   [INFO   ] == Command start (output follows) =====
   [INFO   ] == Command exit (modification check follows) =====

Let's take a look at what configurations and changes come with this procedure:

.. code-block:: console

   $ tree -a

   .
   ├── .gitattributes
   ├── CHANGELOG.md
   ├── code
   │   ├── .gitattributes
   │   └── README.md
   └── README.md

Let's take a closer look into the ``.gitattributes`` files:

.. code-block:: console

   $ less .gitattributes

   **/.git* annex.largefiles=nothing
   CHANGELOG.md annex.largefiles=nothing
   README.md annex.largefiles=nothing

   $ less code/.gitattributes

   * annex.largefiles=nothing

Summarizing these two glimpses into the dataset, this configuration has

#. included a code directory in your dataset
#. included three files for human consumption (``README.md``, ``CHANGELOG.md``)
#. configured everything in the ``code/`` directory to be tracked by Git, not git-annex [#f5]_
#. and configured ``README.md`` and ``CHANGELOG.md`` in the root of the dataset to be
   tracked by Git.

Your next data analysis project can thus get a head start with useful configurations
and the start of a comprehensible directory structure by applying the ``yoda`` procedure.

Sources
^^^^^^^
This section is based on a comprehensive
`poster <https://f1000research.com/posters/7-1965>`_ and publicly
available `slides <https://github.com/myyoda/talk-principles>`_ about the
YODA principles.


.. rubric:: Footnotes

.. [#f1] "Why does the acronym contain itself?" you ask confused.
         "That's because it's a `recursive acronym <https://en.wikipedia.org/wiki/Recursive_acronym>`_,
         where the first letter stands recursively for the whole acronym." you get in response.
         "This is a reference to the recursiveness within a DataLad dataset -- all principles
         apply recursively to all the subdatasets a dataset has."
         "And what does all of this have to do with Yoda?" you ask mildly amused.
         "Oh, well. That's just because the DataLad team is full of geeks."

.. [#f2] If you want to learn more about Docker and Singularity, or general information
         about containerized computational environments for reproducible data science,
         check out `this section <https://the-turing-way.netlify.app/reproducible-research/renv/renv-containers.html>`_
         in the wonderful book `The Turing Way <https://the-turing-way.netlify.app>`_,
         a comprehensive guide to reproducible data science, or read about it in
         section :ref:`containersrun`.

.. [#f4] Substitute unfeasible with *wasteful*, *impractical*, or simply *stupid* if preferred.

.. [#f5] To re-read how ``.gitattributes`` work, go back to section :ref:`config`, and to remind yourself
         about how this worked for the ``text2git`` configuration, go back to section :ref:`text2git`.