datalad-handbook/docs/usecases/reproducible-paper.rst

.. index:: ! Usecase; reproducible paper
.. _usecase_reproducible_paper:

Writing a reproducible paper
----------------------------

This use case demonstrates how to use nested DataLad datasets to create a fully
reproducible paper by linking

#. (different) DataLad dataset sources with
#. the code needed to compute results and
#. LaTeX files to compile the resulting paper.

The different components each exist in individual DataLad datasets and are
aggregated into a single :term:`DataLad superdataset` complying to the YODA principles
for data analysis projects [#f1]_. The resulting superdataset can be publicly
shared, data can be obtained effortlessly on demand by anyone that has the superdataset,
and results and paper can be generated and recomputed everywhere on demand.

A template to start your own reproducible paper with the same set up can be found `on GitHub <https://github.com/datalad-handbook/repro-paper-sketch>`__.

The Challenge
^^^^^^^^^^^^^

Over the past year, Steve worked on the implementation of an algorithm as a software package.
For testing purposes, he used one of his own data collections, and later also included a publicly shared
data collection. After completion, he continued to work on validation analyses to
prove the functionality and usefulness of his software. Next to a directory in which he developed
his code, and directories with data he tested his code on, he now also has other directories
with different data sources used for validation analyses.
"This cannot take too long!" Steve thinks optimistically when he finally sits down to write up a paper.


His scripts run his algorithm on the different data collections, create derivatives of his raw data,
pretty figures, and impressive tables.
Just after he hand-copies and checks the last decimal of the final result in the very
last table of his manuscript, he realizes that the script specified the wrong parameter
values, and all of the results need to be recomputed - and obviously updated in his manuscript.
When writing the discussion, he finds a paper that reports an error in the publicly shared
data collection he uses. After many more days of updating tables and fixing data columns
by hand, he finally submits the paper. Trying to stand with his values of
open and reproducible science, he struggles to bundle all scripts, algorithm code, and data
he used in a shareable form, and frankly, with all the extra time this manuscript took
him so far, he lacks motivation and time. In the end, he writes a three page long README
file in his GitHub code repository, includes his email for data requests, and
secretly hopes that no-one will want to recompute his results, because by now even he
himself forgot which script ran on which dataset and what data was fixed in which way,
or whether he was careful enough to copy all of the results correctly. In the review process,
reviewer 2 demands that the figures his software produces need to get a new color scheme,
which requires updates in his software package, and more recomputations.


The DataLad Approach
^^^^^^^^^^^^^^^^^^^^

Steve sets up a DataLad dataset and calls it ``algorithm-paper``. In this
dataset, he creates several subdirectories to collate everything that is relevant for
the manuscript: Data, code, a manuscript backbone without results.
``code/`` contains a Python script that he uses for validation analyses, and
prior to computing results, the script
attempts to download the data should the files need to be obtained using DataLad's Python API.
``data/`` contains a separate DataLad subdataset for every dataset he uses. An
``algorithm/`` directory is a DataLad dataset containing a clone of his software repository,
and within it, in the directory ``test/data/``, are additional DataLad subdatasets that
contain the data he used for testing.
Lastly, the DataLad superdataset contains a ``LaTeX`` ``.tex`` file with the text of the manuscript.
When everything is set up, a single command line call triggers (optional) data retrieval
from GitHub repositories of the datasets, computation of
results and figures, automatic embedding of results and figures into his manuscript
upon computation, and PDF compiling.
When he notices the error in his script, his manuscript is recompiled and updated
with a single command line call, and when he learns about the data error,
he updates the respective DataLad dataset
to the fixed state while preserving the history of the data repository.


He makes his superdataset a public repository on GitHub, and anyone who clones it can obtain the
data automatically and recompute and recompile the full manuscript with all results.
Steve never had more confidence in his research results and proudly submits his manuscript.
During review, the color scheme update in his algorithm sourcecode is integrated with a simple
update of the ``algorithm/`` subdataset, and upon command-line invocation his manuscript updates
itself with the new figures.


.. importantnote:: Take a look at the real manuscript dataset

   The actual manuscript this use case is based on can be found
   `here <https://github.com/psychoinformatics-de/paper-remodnav>`_:
   https://github.com/psychoinformatics-de/paper-remodnav. :dlcmd:`clone`
   the repository and follow the few instructions in the README to experience the
   DataLad approach described above.

   There is also a slimmed down template that uses the analysis demonstrated in :ref:`yoda_project` and packages it up into a reproducible paper using the same tools: `github.com/datalad-handbook/repro-paper-sketch <https://github.com/datalad-handbook/repro-paper-sketch>`_.


Step-by-Step
^^^^^^^^^^^^

:dlcmd:`create` a DataLad dataset. In this example, it is named "algorithm-paper",
and :dlcmd:`create` uses the yoda procedure [#f1]_ to apply useful configurations
for a data analysis project:

.. code-block:: bash

   $ datalad create -c yoda algorithm-paper

   [INFO   ] Creating a new annex repo at /home/adina/repos/testing/algorithm-paper
   create(ok): /home/adina/repos/testing/algorithm-paper (dataset)

This newly created directory already has a ``code/`` directory that will be tracked with Git
and some ``README.md`` and ``CHANGELOG.md`` files
thanks to the yoda procedure applied above. Additionally, create a subdirectory ``data/`` within
the dataset. This project thus already has a comprehensible structure:

.. code-block:: bash

   $ cd algorithm-paper
   $ mkdir data

   # You can checkout the directory structure with the tree command

   $ tree
   algorithm-paper
   ├── CHANGELOG.md
   ├── code
   │   └── README.md
   ├── data
   └── README.md


All of your analyses scripts should live in the ``code/`` directory, and all input data should
live in the ``data/`` directory.

To populate the DataLad dataset, add all the
data collections you want to perform analyses on as individual DataLad subdatasets within
``data/``.
In this example, all data collections are already DataLad datasets or git repositories and hosted on GitHub.
:dlcmd:`clone` therefore installs them as subdatasets, with ``-d ../``
registering them as subdatasets to the superdataset [#f2]_.

.. code-block:: bash

   $ cd data
   # clone existing git repositories with data (-s specifies the source, in this case, GitHub repositories)
   # -d points to the root of the superdataset
   datalad clone -d ../ https://github.com/psychoinformatics-de/studyforrest-data-phase2.git

   [INFO   ] Cloning https://github.com/psychoinformatics-de/studyforrest-data-phase2.git [1 other candidates] into '/home/adina/repos/testing/algorithm-paper/data/raw_eyegaze'
   install(ok): /home/adina/repos/testing/algorithm-paper/data/raw_eyegaze (dataset)

   $ datalad clone -d ../ git@github.com:psychoinformatics-de/studyforrest-data-eyemovementlabels.git

   [INFO   ] Cloning git@github.com:psychoinformatics-de/studyforrest-data-eyemovementlabels.git into '/home/adina/repos/testing/algorithm-paper/data/studyforrest-data-eyemovementlabels'
   Cloning (compressing objects):  45% 1.80k/4.00k [00:01<00:01, 1.29k objects/s
   [...]

Any script we need for the analysis should live inside ``code/``. During script writing, save any changes
to you want to record in your history with :dlcmd:`save`.

The eventual outcome of this work is a GitHub repository that anyone can use to get the data
and recompute all results
when running the script after cloning and setting up the necessary software.
This requires minor preparation:

* The final analysis should be able to run on anyone's file system.
  It is therefore important to reference datafiles with the scripts in ``code/`` as
  :term:`relative path`\s instead of hard-coding :term:`absolute path`\s.

* After cloning the ``algorithm-paper`` repository, data files are not yet present
  locally. To spare users the work of a manual :dlcmd:`get`, you can have your
  script take care of data retrieval via DataLad's Python API.

These two preparations can be seen in this excerpt from the Python script:

.. code-block:: python

   # import DataLad's API
   from datalad.api import get

   # note that the datapath is relative
   datapath = op.join('data',
                      'studyforrest-data-eyemovementlabels',
                      'sub*',
                      '*run-2*.tsv')
   data = sorted(glob(datapath))

   # this will get the data if it is not yet retrieved
   get(dataset='.', path=data)


Lastly, :dlcmd:`clone` the software repository as a subdataset in the
root of the superdataset [#f3]_.

.. code-block:: bash

   # in the root of ``algorithm-paper`` run
   $ datalad clone -d . git@github.com:psychoinformatics-de/remodnav.git

This repository has also subdatasets in which the datasets used for testing live (``tests/data/``):

.. code-block:: bash

   $ tree
   [...]

   |   ├── remodnav
   │   ├── clf.py
   │   ├── __init__.py
   │   ├── __main__.py
   │   └── tests
   │       ├── data
   │       │   ├── anderson_etal
   │       │   └── studyforrest


At this stage, a public ``algorithm-paper`` repository shares code and data, and changes to any
dataset can easily be handled by updating the respective subdataset.
This already is a big leap towards open and reproducible science. Thanks to DataLad, code,
data, and the history of all code and data are easily shared - with exact versions of all
components and bound together in a single, fully tracked research object.
By making use of the Python API of DataLad and :term:`relative path`\s in scripts,
data retrieval is automated, and scripts can run on any other computer.

Automation with existing tools
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

To go beyond that and include freshly computed results in a manuscript on the fly does not
require DataLad anymore, only some understanding of Python, ``LaTeX``, and Makefiles. As with most things,
its a surprisingly simple challenge if one has just seen how to do it once.
This last section will therefore outline how to compile the results into a PDF manuscript and
automate this process.
In principle, the challenge boils down to:

#. have the script output results (only requires ``print()`` statements)

#. capture these results automatically (done with a single line of Unix commands)

#. embed the captured results in the PDF (done with one line in the ``.tex`` file and
   some clever referencing)

#. automate as much as possible to keep it as simple as possible (done with a Makefile)

That does not sound too bad, does it?
Let's start by revealing how this magic trick works. Everything relies on printing
the results in the form of user-defined ``LaTeX`` definitions (using the ``\newcommand``
command), referencing those definitions in your manuscript where the
results should end up, and bind the ``\newcommand``\s as ``\input{}`` to your ``.tex``
file. But lets get there in small steps.

First, if you want to read up on the ``\newcommand``, please see
`its documentation <https://en.wikibooks.org/wiki/LaTeX/Macros>`_.
The command syntax looks like this:

``\newcommand{\name}[num]{definition}``

What we want to do, expressed in the most human-readable form, is this:

``\newcommand{\Table1Cell1Row1}{0.67}``

where ``0.67`` would be a single result computed by your script.
This requires ``print()`` statements that look like this in the most simple
form (excerpt from script):

.. code-block:: python

   print('\\newcommand{\\maxmclf}{{%.2f}}' % max_mclf)

where ``max_mclf`` is a variable that stores the value of one computation.

Tables and references to results within the ``.tex`` files then do not contain the
specific value ``0.67`` (this value would change if the data changes, or other parameters),
but ``\maxmclf`` (and similar, unique names for other results).
For full tables, one can come up with naming schemes that make it easy
to fill tables with unique names with minimal work, for example like this (excerpt):

.. code-block:: tex

   \begin{table}[tbp]
     \caption{Cohen's Kappa reliability between human coders (MN, RA),
     and \remodnav\ (AL) with each of the human coders.
     }
     \label{tab:kappa}
     \begin{tabular*}{0.5\textwidth}{c @{\extracolsep{\fill}}llll}
       \textbf {Fixations} &                  &                   \\
       \hline\noalign{\smallskip}
       Comparison          & Images           & Dots              \\
       \noalign{\smallskip}\hline\noalign{\smallskip}
       MN versus RA        & \kappaRAMNimgFix & \kappaRAMNdotsFix \\
       AL versus RA        & \kappaALRAimgFix & \kappaALRAdotsFix \\
       AL versus MN        & \kappaALMNimgFix & \kappaALMNdotsFix \\
       \noalign{\smallskip}
       \textbf{Saccades}   &                  &                   \\
       \hline\noalign{\smallskip}
       Comparison          & Images           & Dots              \\
       \noalign{\smallskip}\hline\noalign{\smallskip}
       MN versus RA        & \kappaRAMNimgSac & \kappaRAMNdotsSac \\
       AL versus RA        & \kappaALRAimgSac & \kappaALRAdotsSac \\
       AL versus MN        & \kappaALMNimgSac & \kappaALMNdotsSac \\
       \noalign{\smallskip}
       % [..] more content omitted
     \end{tabular*}
   \end{table}

Without diving into the context of the paper, this table contains results for three
three comparisons ("MN versus RA", "AL versus RA", "AL versus MN"), for three
event types (Fixations, Saccades, and post-saccadic oscillations (PSO)), and three different
stimulus types (Images, Dots, and Videos). The latter event and stimulus are omitted for
better readability of the ``.tex`` excerpt. Here is how this table looks like in the manuscript
(cropped to match the ``.tex`` snippet):

.. figure:: ../artwork/src/img/remodnav.png

It might appear tedious to write scripts that output results for such tables with individual names.
However, ``print()`` statements to fill those tables can utilize Pythons string concatenation methods
and loops to keep the code within a few lines for a full table, such as

.. code-block:: python

   # iterate over stimulus categories
   for stim in ['img', 'dots', 'video']:
      # iterate over event categories
      for ev in ['Fix', 'Sac', 'PSO']:

      [...]

         # create the combinations
         for rating, comb in [('RAMN', [RA_res_flat, MN_res_flat]),
                           ('ALRA', [RA_res_flat, AL_res_flat]),
                           ('ALMN', [MN_res_flat, AL_res_flat])]:
            kappa = cohen_kappa_score(comb[0], comb[1])
            label = 'kappa{}{}{}'.format(rating, stim, ev)
            # print the result
            print('\\newcommand{\\%s}{%s}' % (label, '%.2f' % kappa))


Running the python script will hence print plenty of LaTeX commands to your screen (try it out
in the actual manuscript, if you want!). This was step number 1 of 4.

.. find-out-more:: How about figures?

   To include figures, the figures just need to be saved into a dedicated location (for example,
   a directory ``img/``) and included into the ``.tex`` file with standard ``LaTeX`` syntax.
   Larger figures with subfigures can be created by combining several figures:

   .. code-block:: tex

      \begin{figure*}[tbp]
        \includegraphics[trim=0 8mm 3mm 0,clip,width=.5\textwidth]{img/mainseq_lab}
        \includegraphics[trim=8mm 8mm 0 0,clip,width=.5\textwidth-3.3mm]{img/mainseq_sub_lab} \\
        \includegraphics[trim=0 0 3mm 0,clip,width=.5\textwidth]{img/mainseq_mri}
        \includegraphics[trim=8mm 0 0 0,clip,width=.5\textwidth-3.3mm]{img/mainseq_sub_mri}

        \caption{Main sequence of eye movement events during one 15 minute sequence of
        the movie (segment 2) for lab (top), and MRI participants (bottom). Data
        across all participants per dataset is shown on the left, and data for a single
        exemplary participant on the right.}

        \label{fig:overallComp}
      \end{figure*}

   This figure looks like this in the manuscript:

   ..
      the image can't become a figure because it can't be used in LaTeXs minipage environment

   .. image:: ../artwork/src/img/remodnav2.png

For step 2 and 3, the print statements need to be captured and bound to the ``.tex`` file.
The `tee <https://en.wikipedia.org/wiki/Tee_(command)>`_ command can write all of the output to
a file (called ``results_def.tex``):

.. code-block:: python

   code/mk_figuresnstats.py -s | tee results_def.tex

This will redirect every print statement the script wrote to the terminal into a file called
``results_def.tex``. This file will hence be full of ``\newcommand`` definitions that contain
the results of the computations.

For step 3, one can include this file as an input source into the ``.tex`` file with

.. code-block:: tex

   \begin{document}
   \input{results_def.tex}

Upon compilation of the ``.tex`` file into a PDF, the results of the
computations captured with ``\newcommand`` definitions are inserted into the respective part
of the manuscript.

.. index:: ! Make

The last step is to automate this procedure. So far, the script would need to be executed
with a command line call, and the PDF compilation would require another commandline call.
One way to automate this process are `Makefiles <https://en.wikipedia.org/wiki/Make_(software)>`_.
``make`` is a decades-old tool known to many and bears the important advantage that is will
deliver results regardless of what actually needs to be done with a single ``make`` call --
whether it is executing a Python script, running bash commands, or rendering figures, or all of this.
Here is the one used for the manuscript:

.. code-block:: make
   :linenos:

   all: main.pdf

   main.pdf: main.tex tools.bib EyeGaze.bib results_def.tex figures
       latexmk -pdf -g $<

   results_def.tex: code/mk_figuresnstats.py
       bash -c 'set -o pipefail; code/mk_figuresnstats.py -s | tee results_def.tex'

   figures: figures-stamp

   figures-stamp: code/mk_figuresnstats.py
       code/mk_figuresnstats.py -f -r -m
       $(MAKE) -C img
       touch $@

   clean:
       rm -f main.bbl main.aux main.blg main.log main.out main.pdf main.tdo main.fls main.fdb_latexmk example.eps img/*eps-converted-to.pdf texput.log results_def.tex figures-stamp
       $(MAKE) -C img clean

One can read a Makefile as a recipe:

- Line 1: "The overall target should be ``main.pdf`` (the final PDF of
  the manuscript)."

- Line 2-3: "To make the target ``main.pdf``, the following files are required:
  ``main.tex`` (the manuscript's ``.tex`` file), ``tools.bib`` & ``EyeGaze.bib`` (bibliography files), ``results_def.tex``
  (the results definitions), and figures (a section not covered here, about rendering figures
  with inkscape prior to including them in the manuscript). If all of these files are present,
  the target ``main.pdf`` can be made by running the command ``latexmk -pdf -g``"

- Line 5-6: "To make the target ``results_def.tex``, the script ``code/mk_figuresnstats.py`` is
  required. If the file is present, the target ``results_def.tex`` can be made by running the
  command ``bash -c 'set -o pipefail; code/mk_figuresnstats.py -s | tee results_def.tex'``"

This triggers the execution of the script, collection of results in ``results_def.tex``, and PDF
compilation upon typing ``make``.
The last three lines define that a ``make clean`` removes all computed files, and also all
images.

Finally, by wrapping ``make`` in a :dlcmd:`run` command, the computation of results
and compiling of the manuscript with all generated output can be written to the history of
the superdataset. ``datalad run make`` will thus capture all provenance for the results
and the final PDF.

Thus, by using DataLad and its Python API, a few clever Unix and ``LaTeX`` tricks,
and Makefiles, anyone can create a reproducible paper. This saves time, increases your own
trust in the results, and helps to make a more convincing case with your research.
If you have not yet, but are curious, checkout the
`manuscript this use case is based on <https://github.com/psychoinformatics-de/paper-remodnav>`_.
Any questions can be asked by `opening an issue <https://github.com/psychoinformatics-de/paper-remodnav/issues/new>`_.

.. rubric:: Footnotes

.. [#f1] You can read up on the YODA principles again in section :ref:`yoda`

.. [#f2] You can read up on cloning datasets as subdatasets again in section :ref:`installds`.

.. [#f3] Note that the software repository may just as well be cloned into ``data/``.