466 lines
22 KiB
ReStructuredText
466 lines
22 KiB
ReStructuredText
.. index:: ! YODA principles
|
||
.. _2-001:
|
||
.. _yoda:
|
||
|
||
YODA: Best practices for data analyses in a dataset
|
||
---------------------------------------------------
|
||
|
||
The last requirement for the midterm projects reads "needs to comply to the
|
||
YODA principles".
|
||
"What are the YODA principles?" you ask, as you have never heard of this
|
||
before.
|
||
"The topic of today's lecture: Organizational principles of data
|
||
analyses in DataLad datasets. This lecture will show you the basic
|
||
principles behind creating, sharing, and publishing reproducible,
|
||
understandable, and open data analysis projects with DataLad.", you
|
||
hear in return.
|
||
|
||
The starting point...
|
||
^^^^^^^^^^^^^^^^^^^^^
|
||
|
||
Data analyses projects are very common, both in science and industry.
|
||
But it can be very difficult to produce a reproducible, let alone
|
||
*comprehensible* data analysis project.
|
||
Many data analysis projects do not start out with
|
||
a stringent organization, or fail to keep the structural organization of a
|
||
directory intact as the project develops. Often, this can be due to a lack of
|
||
version-control. In these cases, a project will quickly end up
|
||
with many
|
||
`almost-identical scripts suffixed with "_version_xyz" <https://phdcomics.com/comics/archive.php?comicid=1531>`_,
|
||
or a chaotic results structure split between various directories with names
|
||
such as ``results/``, ``results_August19/``, ``results_revision/`` and
|
||
``now_with_nicer_plots/``. Something like this is a very
|
||
common shape a data science project may take after a while:
|
||
|
||
.. code-block:: console
|
||
|
||
├── code/
|
||
│ ├── code_final/
|
||
│ │ ├── final_2/
|
||
│ │ │ ├── main_script_fixed.py
|
||
│ │ │ └──takethisscriptformostthingsnow.py
|
||
│ │ ├── utils_new.py
|
||
│ │ ├── main_script.py
|
||
│ │ ├── utils_new.py
|
||
│ │ ├── utils_2.py
|
||
│ │ └── main_analysis_newparameters.py
|
||
│ └── main_script_DONTUSE.py
|
||
├── data/
|
||
│ ├── data_updated/
|
||
│ │ └── dataset1/
|
||
│ │ └── datafile_a
|
||
│ ├── dataset1/
|
||
│ │ └── datafile_a
|
||
│ ├── outputs/
|
||
│ │ ├── figures/
|
||
│ │ │ ├── figures_new.py
|
||
│ │ │ └── figures_final_forreal.py
|
||
│ │ ├── important_results/
|
||
│ │ ├── random_results_file.tsv
|
||
│ │ ├── results_for_paper/
|
||
│ │ ├── results_for_paper_revised/
|
||
│ │ └── results_new_data/
|
||
│ ├── random_results_file.tsv
|
||
│ ├── random_results_file_v2.tsv
|
||
|
||
[...]
|
||
|
||
All data analysis endeavors in directories like this *can* work, for a while,
|
||
if there is a person who knows the project well, and works on it all the time.
|
||
But it inevitably will get messy once anyone tries to collaborate on a project
|
||
like this, or simply goes on a two-week vacation and forgets whether
|
||
the function in ``main_analysis_newparameters.py`` or the one in
|
||
``takethisscriptformostthingsnow.py`` was the one that created a particular figure.
|
||
|
||
But even if a project has an intuitive structure, and *is* version
|
||
controlled, in many cases an analysis script will stop working, or maybe worse,
|
||
will produce different results, because the software and tools used to
|
||
conduct the analysis in the first place got an update. This update may have
|
||
come with software changes that made functions stop working, or work differently
|
||
than before.
|
||
In the same vein, recomputing an analysis project on a different machine than
|
||
the one the analysis was developed on can fail if the necessary
|
||
software in the required versions is not installed or available on this new machine.
|
||
The analysis might depend on software that runs on a Linux machine, but the project
|
||
was shared with a Windows user. The environment during analysis development used
|
||
Python 2, but the new system has only Python 3 installed. Or one of the dependent
|
||
libraries needs to be in version X, but is installed as version Y.
|
||
|
||
The YODA principles are a clear set of organizational standards for
|
||
datasets used for data analysis projects that aim to overcome issues like the
|
||
ones outlined above. The name stands for
|
||
"YODAs Organigram on Data Analysis" [#f1]_. The principles outlined
|
||
in YODA set simple rules for directory names and structures, best-practices for
|
||
version-controlling dataset elements and analyses, facilitate
|
||
usage of tools to improve the reproducibility and accountability
|
||
of data analysis projects, and make collaboration easier.
|
||
They are summarized in three basic principles, that translate to both
|
||
dataset structures and best practices regarding the analysis:
|
||
|
||
- :ref:`P1`
|
||
|
||
- :ref:`P2`
|
||
|
||
- :ref:`P3`
|
||
|
||
As you will see, complying to these principles is easy if you
|
||
use DataLad. Let's go through them one by one:
|
||
|
||
.. _P1:
|
||
|
||
P1: One thing, one dataset
|
||
^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||
|
||
Whenever a particular collection of files could be useful in more
|
||
than one context, make them a standalone, modular component.
|
||
In the broadest sense, this means to structure your study elements (data, code,
|
||
computational environments, results, ...) in dedicated directories. For example:
|
||
|
||
|
||
- Store **input data** for an analysis in a dedicated ``inputs/`` directory.
|
||
Keep different formats or processing-stages of your input data as individual,
|
||
modular components: Do not mix raw data, data that is already structured
|
||
following community guidelines of the given field, or preprocessed data, but create
|
||
one data component for each of them. And if your analysis
|
||
relies on two or more data collections, these collections should each be an
|
||
individual component, not combined into one.
|
||
|
||
- Store scripts or **code** used for the analysis of data in a dedicated ``code/``
|
||
directory, outside of the data component of the dataset.
|
||
|
||
- Collect **results** of an analysis in a dedicated place, outside of the ``inputs/`` directory, and
|
||
leave the input data of an analysis untouched by your computations. (Ideally, don't create an "``outputs``" directory for this - when your dataset becomes an input for another, the resulting file paths would be something like ``/inputs/outputs/somefile.dat``)
|
||
|
||
- Include a place for complete **execution environments**, such as
|
||
`singularity images <https://singularity.lbl.gov>`_ or
|
||
`docker containers <https://www.docker.com/get-started>`_ [#f2]_, in
|
||
the form of an ``envs/`` directory, if relevant for your analysis.
|
||
|
||
- And if you conduct multiple different analyses, create a dedicated
|
||
project for each analysis, instead of conflating them.
|
||
|
||
This, for example, would be a directory structure from the root of a
|
||
superdataset of a very comprehensive data analysis project complying to the YODA principles:
|
||
|
||
.. code-block:: console
|
||
|
||
├── ci/ # continuous integration configuration
|
||
│ └── .travis.yml
|
||
├── code/ # your code
|
||
│ ├── tests/ # unit tests to test your code
|
||
│ │ └── test_myscript.py
|
||
│ └── myscript.py
|
||
├── docs # documentation about the project
|
||
│ ├── build/
|
||
│ └── source/
|
||
├── envs # computational environments
|
||
│ └── Singularity
|
||
├── inputs/ # dedicated inputs/, will not be changed by an analysis
|
||
│ └─── data/
|
||
│ ├── dataset1/ # one stand-alone data component
|
||
│ │ └── datafile_a
|
||
│ └── dataset2/
|
||
│ └── datafile_a
|
||
├── important_results/ # outputs away from the input data
|
||
│ └── figures/
|
||
├── CHANGELOG.md # notes for fellow humans about your project
|
||
├── HOWTO.md
|
||
└── README.md
|
||
|
||
You can get a few non-DataLad related advice for structuring your directories in the :ref:`on best practices for analysis organization <fom-yodaproject>`.
|
||
|
||
.. index::
|
||
pair: recommendation; dataset content organization
|
||
.. find-out-more:: More best practices for organizing contents in directories
|
||
:name: fom-yodaproject
|
||
:float:
|
||
|
||
The exemplary YODA directory structure is very comprehensive, and displays many best-practices for
|
||
reproducible data science. For example,
|
||
|
||
#. Within ``code/``, it is best practice to add **tests** for the code.
|
||
These tests can be run to check whether the code still works.
|
||
|
||
#. It is even better to further use automated computing such as
|
||
`continuous integration (CI) systems <https://en.wikipedia.org/wiki/Continuous_integration>`_,
|
||
to test the functionality of your functions and scripts automatically.
|
||
If relevant, the setup for continuous integration frameworks (such as
|
||
`Appveyor <https://www.appveyor.com>`_) lives outside of ``code/``,
|
||
in a dedicated ``ci/`` directory.
|
||
|
||
#. Include **documents for fellow humans**: Notes in a README.md or a HOWTO.md,
|
||
or even proper documentation (for example, using in a dedicated ``docs/`` directory.
|
||
Within these documents, include all relevant metadata for your analysis. If you are
|
||
conducting a scientific study, this might be authorship, funding,
|
||
change log, etc.
|
||
|
||
If writing tests for analysis scripts or using continuous integration
|
||
is a new idea for you, but you want to learn more, check out
|
||
`this chapter on testing <https://the-turing-way.netlify.app/reproducible-research/testing>`_.
|
||
|
||
There are many advantages to this modular way of organizing contents.
|
||
Having input data as independent components that are not altered (only
|
||
consumed) by an analysis does not conflate the data for
|
||
an analysis with the results or the code, thus assisting understanding
|
||
the project for anyone unfamiliar with it.
|
||
But more than just structure, this organization aids modular reuse or
|
||
publication of the individual components, for example data. In a
|
||
YODA-compliant dataset, any processing stage of a data component can
|
||
be reused in a new project or published and shared. The same is true
|
||
for a whole analysis dataset. At one point you might also write a
|
||
scientific paper about your analysis in a paper project, and the
|
||
whole analysis project can easily become a modular component in a paper
|
||
project, to make sharing paper, code, data, and results easy.
|
||
The use case :ref:`usecase_reproducible_paper` contains a step-by-step instruction on
|
||
how to build and share such a reproducible paper, if you want to learn
|
||
more.
|
||
|
||
|
||
|
||
.. figure:: ../artwork/src/img/dataset_modules.svg
|
||
:width: 100%
|
||
:name: dataset_modules
|
||
:alt: Modular structure of a data analysis project
|
||
|
||
Data are modular components that can be reused easily.
|
||
|
||
The directory tree above and :numref:`dataset_modules` highlight different aspects
|
||
of this principle. The directory tree illustrates the structure of
|
||
the individual pieces on the file system from the point of view of
|
||
a single top-level dataset with a particular purpose. For example, it
|
||
could be an analysis dataset created by a statistician for a scientific
|
||
project, and it could be shared between collaborators or
|
||
with others during development of the project. In this
|
||
superdataset, code is created that operates on input data to
|
||
compute outputs, and the code and outputs are captured,
|
||
version-controlled, and linked to the input data. Each input data in turn
|
||
is a (potentially nested) subdataset, but this is not visible
|
||
in the directory hierarchy.
|
||
:numref:`dataset_modules`, in comparison, emphasizes a process view on a project and
|
||
the nested structure of input subdataset:
|
||
You can see how the preprocessed data that serves as an input for
|
||
the analysis datasets evolves from raw data to
|
||
standardized data organization to its preprocessed state. Within
|
||
the ``data/`` directory of the file system hierarchy displayed
|
||
above one would find data datasets with their previous version as
|
||
a subdataset, and this is repeated recursively until one reaches
|
||
the raw data as it was originally collected at one point. A finished
|
||
analysis project in turn can be used as a component (subdataset) in
|
||
a paper project, such that the paper is a fully reproducible research
|
||
object that shares code, analysis results, and data, as well as the
|
||
history of all of these components.
|
||
|
||
Principle 1, therefore, encourages to structure data analysis
|
||
projects in a clear and modular fashion that makes use of nested
|
||
DataLad datasets, yielding comprehensible structures and reusable
|
||
components. Having each component version-controlled --
|
||
regardless of size -- will aid keeping directories clean and
|
||
organized, instead of piling up different versions of code, data,
|
||
or results.
|
||
|
||
.. _P2:
|
||
|
||
P2: Record where you got it from, and where it is now
|
||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||
|
||
It is good to have data, but it is even better if you and anyone you
|
||
collaborate or share the project or its components with can find
|
||
out where the data came from, or how it
|
||
is dependent on or linked to other data. Therefore, this principle
|
||
aims to attach this information, the data's :term:`provenance`, to the components of
|
||
your data analysis project.
|
||
|
||
Luckily, this is a no-brainer with DataLad, because the core data structure
|
||
of DataLad, the dataset, and many of the DataLad commands already covered
|
||
up to now fulfill this principle.
|
||
|
||
If data components of a project are DataLad datasets, they can
|
||
be included in an analysis superdataset as subdatasets. Thanks to
|
||
:dlcmd:`clone`, information on the source of these subdatasets
|
||
is stored in the history of the analysis superdataset, and they can even be
|
||
updated from those sources if the original data dataset gets extended or changed.
|
||
If you are including a file, for example, code from GitHub,
|
||
the :dlcmd:`download-url` command, introduced in section :ref:`populate`,
|
||
will record the source of it safely in the dataset's history. And if you add anything to your dataset,
|
||
from simple incremental coding progress in your analysis scripts up to
|
||
files that a colleague sent you via email, a plain :dlcmd:`save`
|
||
with a helpful commit message goes a very long way to fulfill this principle
|
||
on its own already.
|
||
|
||
One core aspect of this principle is *linking* between reusable data
|
||
resource units (i.e., DataLad subdatasets containing pure data). You will
|
||
be happy to hear that this is achieved by simply installing datasets
|
||
as subdatasets, as :numref:`fig-subds` shows.
|
||
This part of this principle will therefore be absolutely obvious to you
|
||
because you already know how to install and nest datasets within datasets.
|
||
"I might just overcome my impostor syndrome if I experience such advanced
|
||
reproducible analysis concepts as being obvious", you think with a grin.
|
||
|
||
.. _fig-subds:
|
||
|
||
.. figure:: ../artwork/src/img/data_origin.svg
|
||
:width: 50%
|
||
:alt: Datasets are installed as subdatasets
|
||
|
||
Schematic illustration of two standalone data datasets installed as subdatasets
|
||
into an analysis project.
|
||
|
||
But more than linking datasets in a superdataset, linkage also needs to
|
||
be established between components of your dataset. Scripts inside of
|
||
your ``code/`` directory should point to data not as :term:`absolute path`\s
|
||
that would only work on your system, but instead as :term:`relative path`\s
|
||
that will work in any shared copy of your dataset. The next section
|
||
demonstrates a YODA data analysis project and will show concrete examples of this.
|
||
|
||
Lastly, this principle also includes *moving*, *sharing*, and *publishing* your
|
||
datasets or its components.
|
||
It is usually costly to collect data, and economically unfeasible [#f4]_ to keep
|
||
it locked in a drawer (or similarly out of reach behind complexities of
|
||
data retrieval or difficulties in understanding the data structure).
|
||
But conducting several projects on the same dataset yourself, sharing it with
|
||
collaborators, or publishing it is easy if the project is a DataLad dataset
|
||
that can be installed and retrieved on demand, and is kept clean from
|
||
everything that is not part of the data according to principle 1.
|
||
Conducting transparent open science is easier if you can link code, data,
|
||
and results within a dataset, and share everything together. In conjunction
|
||
with principle 1, this means that you can distribute your analysis projects
|
||
(or parts of it) in a comprehensible form, exemplified in :numref:`fig-yodads`.
|
||
|
||
.. _fig-yodads:
|
||
|
||
.. figure:: ../artwork/src/img/decentralized_publishing.svg
|
||
:figwidth: 100%
|
||
:alt: A full data analysis workflow complying with YODA principles
|
||
|
||
In a dataset that complies to the YODA principles, modular components
|
||
(data, analysis results, papers) can be shared or published easily.
|
||
|
||
Principle 2, therefore, facilitates transparent linkage of datasets and their
|
||
components to other components, their original sources, or shared copies.
|
||
With the DataLad tools you learned to master up to this point,
|
||
you have all the necessary skills to comply to it already.
|
||
|
||
.. _P3:
|
||
|
||
P3: Record what you did to it, and with what
|
||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||
|
||
This last principle is about capturing *how exactly the content of
|
||
every file came to be* that was not obtained from elsewhere. For example,
|
||
this relates to results generated from inputs by scripts or commands.
|
||
The section :ref:`run` already outlined the problem of associating
|
||
a result with an input and a script. It can be difficult to link a
|
||
figure from your data analysis project with an input data file or a
|
||
script, even if you created this figure yourself.
|
||
The :dlcmd:`run` command however mitigates these difficulties,
|
||
and captures the provenance of any output generated with a
|
||
``datalad run`` call in the history of the dataset. Thus, by using
|
||
:dlcmd:`run` in analysis projects, your dataset knows
|
||
which result was generated when, by which author, from which inputs,
|
||
and by means of which command.
|
||
|
||
With another DataLad command one can even go one step further:
|
||
The command :dlcmd:`containers-run` - it will be introduced in
|
||
section :ref:`containersrun` - performs a command execution within
|
||
a configured containerized environment. Thus, not only inputs,
|
||
outputs, command, time, and author, but also the *software environment*
|
||
are captured as provenance of a dataset component such as a results file,
|
||
and, importantly, can be shared together with the dataset in the
|
||
form of a software container.
|
||
|
||
Tip: Make use of ``datalad run``'s ``--dry-run`` option to craft your run-command, as outlined in :ref:`dryrun`!
|
||
|
||
With this last principle, your dataset collects and stores provenance
|
||
of all the contents you created in the wake of your analysis project.
|
||
This established trust in your results, and enables others to understand
|
||
where files derive from.
|
||
|
||
.. _yodaproc:
|
||
|
||
The YODA procedure
|
||
^^^^^^^^^^^^^^^^^^
|
||
|
||
There is one tool that can make starting a yoda-compliant data analysis
|
||
easier: DataLad's ``yoda`` procedure. Just as the ``text2git`` procedure
|
||
from section :ref:`createds`, the ``yoda`` procedure can be included in a
|
||
:dlcmd:`create` command and will apply useful configurations
|
||
to your dataset:
|
||
|
||
.. code-block:: console
|
||
|
||
$ datalad create -c yoda "my_analysis"
|
||
|
||
[INFO ] Creating a new annex repo at /home/me/repos/testing/my_analysis
|
||
create(ok): /home/me/repos/testing/my_analysis (dataset)
|
||
[INFO ] Running procedure cfg_yoda
|
||
[INFO ] == Command start (output follows) =====
|
||
[INFO ] == Command exit (modification check follows) =====
|
||
|
||
Let's take a look at what configurations and changes come with this procedure:
|
||
|
||
.. code-block:: console
|
||
|
||
$ tree -a
|
||
|
||
.
|
||
├── .gitattributes
|
||
├── CHANGELOG.md
|
||
├── code
|
||
│ ├── .gitattributes
|
||
│ └── README.md
|
||
└── README.md
|
||
|
||
Let's take a closer look into the ``.gitattributes`` files:
|
||
|
||
.. code-block:: console
|
||
|
||
$ less .gitattributes
|
||
|
||
**/.git* annex.largefiles=nothing
|
||
CHANGELOG.md annex.largefiles=nothing
|
||
README.md annex.largefiles=nothing
|
||
|
||
$ less code/.gitattributes
|
||
|
||
* annex.largefiles=nothing
|
||
|
||
Summarizing these two glimpses into the dataset, this configuration has
|
||
|
||
#. included a code directory in your dataset
|
||
#. included three files for human consumption (``README.md``, ``CHANGELOG.md``)
|
||
#. configured everything in the ``code/`` directory to be tracked by Git, not git-annex [#f5]_
|
||
#. and configured ``README.md`` and ``CHANGELOG.md`` in the root of the dataset to be
|
||
tracked by Git.
|
||
|
||
Your next data analysis project can thus get a head start with useful configurations
|
||
and the start of a comprehensible directory structure by applying the ``yoda`` procedure.
|
||
|
||
Sources
|
||
^^^^^^^
|
||
This section is based on a comprehensive
|
||
`poster <https://f1000research.com/posters/7-1965>`_ and publicly
|
||
available `slides <https://github.com/myyoda/talk-principles>`_ about the
|
||
YODA principles.
|
||
|
||
|
||
.. rubric:: Footnotes
|
||
|
||
.. [#f1] "Why does the acronym contain itself?" you ask confused.
|
||
"That's because it's a `recursive acronym <https://en.wikipedia.org/wiki/Recursive_acronym>`_,
|
||
where the first letter stands recursively for the whole acronym." you get in response.
|
||
"This is a reference to the recursiveness within a DataLad dataset -- all principles
|
||
apply recursively to all the subdatasets a dataset has."
|
||
"And what does all of this have to do with Yoda?" you ask mildly amused.
|
||
"Oh, well. That's just because the DataLad team is full of geeks."
|
||
|
||
.. [#f2] If you want to learn more about Docker and Singularity, or general information
|
||
about containerized computational environments for reproducible data science,
|
||
check out `this section <https://the-turing-way.netlify.app/reproducible-research/renv/renv-containers.html>`_
|
||
in the wonderful book `The Turing Way <https://the-turing-way.netlify.app>`_,
|
||
a comprehensive guide to reproducible data science, or read about it in
|
||
section :ref:`containersrun`.
|
||
|
||
.. [#f4] Substitute unfeasible with *wasteful*, *impractical*, or simply *stupid* if preferred.
|
||
|
||
.. [#f5] To re-read how ``.gitattributes`` work, go back to section :ref:`config`, and to remind yourself
|
||
about how this worked for the ``text2git`` configuration, go back to section :ref:`text2git`.
|