datalad-handbook/docs/basics/101-128-summary_yoda.rst

68 lines
3.3 KiB
ReStructuredText

.. _summary_yoda:
Summary
-------
The YODA principles are a small set of guidelines that can make a huge
difference towards reproducibility, comprehensibility, and transparency
in a data analysis project. By applying them in your own midterm analysis
project, you have experienced their immediate benefits.
You also noticed that these standards are not complex -- quite the opposite,
they are very intuitive.
They structure essential components of a data analysis project --
data, code, potentially computational environments, and lastly also the results --
in a modular and practical way, and use basic principles and commands
of DataLad you are already familiar with.
There are many advantages to this organization of contents.
- Having input data as independent dataset(s) that are not influenced (only
consumed) by an analysis allows for a modular reuse of pure data datasets,
and does not conflate the data of an analysis with the results or the code.
You have experienced this with the ``iris_data`` subdataset.
- Keeping code within an independent, version-controlled directory, but as a part
of the analysis dataset, makes sharing code easy and transparent, and helps
to keep directories neat and organized. Moreover,
with the data as subdatasets, data and code can be automatically shared together.
By complying to this principle, you were able to submit both code and data
in a single superdataset.
- Keeping an analysis dataset fully self-contained with relative instead of
absolute paths in scripts is critical to ensure that an analysis reproduces
easily on a different computer.
- DataLad's Python API makes all of DataLad's functionality available in
Python, either as standalone functions that are exposed via ``datalad.api``,
or as methods of the ``Dataset`` class.
This provides an alternative to the command line, but it also opens up the
possibility of performing DataLad commands directly inside of scripts.
- Including the computational environment into an analysis dataset encapsulates
software and software versions, and thus prevents re-computation failures
(or sudden differences in the results) once
software is updated, and software conflicts arising on different machines
than the one the analysis was originally conducted on. You have not yet
experienced how to do this first-hand, but you will in a later section.
- Having all of these components as part of a DataLad dataset allows version
controlling all pieces within the analysis regardless of their size, and
generates provenance for everything, especially if you make use of the tools
that DataLad provides. This way, anyone can understand and even reproduce
your analysis without much knowledge about your project.
- The yoda procedure is a good starting point to build your next data analysis
project up on.
Now what can I do with it?
^^^^^^^^^^^^^^^^^^^^^^^^^^
Using tools that DataLad provides you are able to make the most out of
your data analysis project. The YODA principles are a guide to accompany
you on your path to reproducibility and provenance-tracking.
What should have become clear in this section is that you are already
equipped with enough DataLad tools and knowledge that complying to these
standards felt completely natural and effortless in your midterm analysis
project.