Dockerizing the analysis #20

Closed
opened 2023-10-11 09:13:56 +00:00 by mih · 7 comments
mih commented 2023-10-11 09:13:56 +00:00 (Migrated from github.com)

TL;DR Versions did not matter, time did not matter, but it matters who compiles numpy

Five years after we did this analysis, I am trying to compile a docker-based environment. I can success building the stats and figures in a wide variety of configurations. However, there are small differences.

I am collecting some notes here, trying to narrow down on a setup the reproduces the stats exeactly:

Debian buster PY3.7

FROM debian:buster-slim
RUN apt-get update -qq -y --allow-releaseinfo-change
RUN apt-get install -q --no-install-recommends -y inkscape latexmk texlive-latex-extra python3-pip make
RUN apt-get install -q --no-install-recommends -y build-essential python3-dev cython3 python3-setuptools python3-wheel
RUN apt-get install -q --no-install-recommends -y python3-numpy
RUN apt-get install -q --no-install-recommends -y python3-scipy
RUN apt-get install -q --no-install-recommends -y python3-pandas
RUN apt-get install -q --no-install-recommends -y python3-sklearn
RUN apt-get install -q --no-install-recommends -y python3-statsmodels
RUN apt-get clean
RUN python3 -m pip install -v --no-build-isolation --prefer-binary seaborn==0.10.1 scikit-learn matplotlib==3.4.3

pip freeze gives

cycler==0.11.0
Cython==0.29.2
decorator==4.3.0
joblib==0.13.0
kiwisolver==1.4.5
matplotlib==3.4.3
numpy==1.16.2
pandas==0.23.3+dfsg
patsy==0.5.0+dev
Pillow==8.3.2
pyparsing==3.1.1
python-dateutil==2.7.3
pytz==2019.1
scikit-learn==0.20.2
scipy==1.1.0
seaborn==0.10.1
six==1.12.0
statsmodels==0.8.0
typing-extensions==4.7.1

after running the analysis, the following diff occurs

modified:   img/confusion_MN_AL.svg
modified:   img/confusion_RA_AL.svg
modified:   img/remodnav_lab.svg
modified:   img/remodnav_mri.svg
modified:   results_def.tex
\newcommand{\videoMNALMclfWOP}{7.9}\newcommand{\videoMNALMclfWOP}{8.1}
\newcommand{\videoMNALFIXcod}{36}\newcommand{\videoMNALFIXcod}{37}
\newcommand{\dotsRAALMclfWOP}{10.8}\newcommand{\dotsRAALMclfWOP}{10.9}
\newcommand{\videoRAALMCLF}{28.5}\newcommand{\videoRAALMCLF}{28.6}
\newcommand{\maxmclf}{10.8}\newcommand{\maxmclf}{10.9}
\newcommand{\FIXvideomnRE}{147}\newcommand{\FIXvideomnRE}{146}
\newcommand{\FIXvideonoRE}{144}\newcommand{\FIXvideonoRE}{145}
\newcommand{\PURvideomnRE}{314}\newcommand{\PURvideomnRE}{313}
\newcommand{\rankFIXvideoIHMM}{6}\newcommand{\rankFIXvideoIHMM}{5}
\newcommand{\rankFIXvideoRE}{5}\newcommand{\rankFIXvideoRE}{6}

Debian bullseye PY3.9

FROM debian:bullseye-slim
RUN apt-get update -qq -y --allow-releaseinfo-change
RUN apt-get install -q --no-install-recommends -y inkscape latexmk texlive-latex-extra python3-pip make
RUN apt-get install -q --no-install-recommends -y build-essential python3-dev cython3 python3-setuptools python3-wheel
RUN apt-get install -q --no-install-recommends -y python3-numpy
RUN apt-get install -q --no-install-recommends -y python3-scipy
RUN apt-get install -q --no-install-recommends -y python3-pandas
RUN apt-get install -q --no-install-recommends -y python3-sklearn
RUN apt-get install -q --no-install-recommends -y python3-statsmodels
RUN apt-get clean
RUN python3 -m pip install -v --no-build-isolation --prefer-binary seaborn==0.10.1 scikit-learn matplotlib==3.4.3

pip freeze gives

cycler==0.12.1
Cython==0.29.21
decorator==4.4.2
joblib==0.17.0
kiwisolver==1.4.5
matplotlib==3.4.3
numpy==1.19.5
packaging==23.2
pandas==1.1.5
patsy==0.5.3
Pillow==10.0.1
pyparsing==3.1.1
python-dateutil==2.8.1
pytz==2021.1
-e remodnav==1.0
scikit-learn==0.23.2
scipy==1.6.0
seaborn==0.10.1
six==1.16.0
statsmodels==0.14.0

The diff of the statistical scores is identical compared to the bullseye container. Also the same SVG are modified (also looks identical inside).

Ubuntu focal PY3.8

FROM ubuntu:focal
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update -qq -y --allow-releaseinfo-change
RUN apt-get install -q --no-install-recommends -y inkscape latexmk texlive-latex-extra python3-pip make
RUN apt-get install -q --no-install-recommends -y build-essential python3-dev cython3 python3-setuptools python3-wheel
RUN apt-get install -q --no-install-recommends -y python3-numpy
RUN apt-get install -q --no-install-recommends -y python3-scipy
RUN apt-get install -q --no-install-recommends -y python3-pandas
RUN apt-get install -q --no-install-recommends -y python3-sklearn
RUN apt-get install -q --no-install-recommends -y python3-statsmodels
RUN apt-get clean
RUN python3 -m pip install -v --no-build-isolation --prefer-binary seaborn==0.10.1 scikit-learn matplotlib==3.4.3

pip freeze gives

cycler==0.12.1
Cython==0.29.14
decorator==4.4.2
joblib==0.14.0
kiwisolver==1.4.5
matplotlib==3.4.3
numpy==1.17.4
pandas==0.25.3
patsy==0.5.1
Pillow==10.0.1
pyparsing==3.1.1
python-dateutil==2.7.3
pytz==2019.3
-e remodnav==1.0
scikit-learn==0.22.2.post1
scipy==1.3.3
seaborn==0.10.1
six==1.14.0
statsmodels==0.11.1

The diff of the statistical scores is identical compared to the bullseye and buster containers. Also the same SVG are modified (also looks identical inside).

Conclusions

  • matplotlib must be pinned for reproducibility
    • it writes it's version into the SVGs
  • a wide variety of environments and software versions produces exact results. But none of them reproduces the results committed in the repository.
TL;DR Versions did not matter, time did not matter, but it matters **who** compiles numpy Five years after we did this analysis, I am trying to compile a docker-based environment. I can success building the stats and figures in a wide variety of configurations. However, there are small differences. I am collecting some notes here, trying to narrow down on a setup the reproduces the stats exeactly: ### Debian buster PY3.7 ```Dockerfile FROM debian:buster-slim RUN apt-get update -qq -y --allow-releaseinfo-change RUN apt-get install -q --no-install-recommends -y inkscape latexmk texlive-latex-extra python3-pip make RUN apt-get install -q --no-install-recommends -y build-essential python3-dev cython3 python3-setuptools python3-wheel RUN apt-get install -q --no-install-recommends -y python3-numpy RUN apt-get install -q --no-install-recommends -y python3-scipy RUN apt-get install -q --no-install-recommends -y python3-pandas RUN apt-get install -q --no-install-recommends -y python3-sklearn RUN apt-get install -q --no-install-recommends -y python3-statsmodels RUN apt-get clean RUN python3 -m pip install -v --no-build-isolation --prefer-binary seaborn==0.10.1 scikit-learn matplotlib==3.4.3 ``` `pip freeze` gives ``` cycler==0.11.0 Cython==0.29.2 decorator==4.3.0 joblib==0.13.0 kiwisolver==1.4.5 matplotlib==3.4.3 numpy==1.16.2 pandas==0.23.3+dfsg patsy==0.5.0+dev Pillow==8.3.2 pyparsing==3.1.1 python-dateutil==2.7.3 pytz==2019.1 scikit-learn==0.20.2 scipy==1.1.0 seaborn==0.10.1 six==1.12.0 statsmodels==0.8.0 typing-extensions==4.7.1 ``` after running the analysis, the following diff occurs ``` modified: img/confusion_MN_AL.svg modified: img/confusion_RA_AL.svg modified: img/remodnav_lab.svg modified: img/remodnav_mri.svg modified: results_def.tex ``` ``` \newcommand{\videoMNALMclfWOP}{7.9}\newcommand{\videoMNALMclfWOP}{8.1} \newcommand{\videoMNALFIXcod}{36}\newcommand{\videoMNALFIXcod}{37} \newcommand{\dotsRAALMclfWOP}{10.8}\newcommand{\dotsRAALMclfWOP}{10.9} \newcommand{\videoRAALMCLF}{28.5}\newcommand{\videoRAALMCLF}{28.6} \newcommand{\maxmclf}{10.8}\newcommand{\maxmclf}{10.9} \newcommand{\FIXvideomnRE}{147}\newcommand{\FIXvideomnRE}{146} \newcommand{\FIXvideonoRE}{144}\newcommand{\FIXvideonoRE}{145} \newcommand{\PURvideomnRE}{314}\newcommand{\PURvideomnRE}{313} \newcommand{\rankFIXvideoIHMM}{6}\newcommand{\rankFIXvideoIHMM}{5} \newcommand{\rankFIXvideoRE}{5}\newcommand{\rankFIXvideoRE}{6} ``` ### Debian bullseye PY3.9 ```Dockerfile FROM debian:bullseye-slim RUN apt-get update -qq -y --allow-releaseinfo-change RUN apt-get install -q --no-install-recommends -y inkscape latexmk texlive-latex-extra python3-pip make RUN apt-get install -q --no-install-recommends -y build-essential python3-dev cython3 python3-setuptools python3-wheel RUN apt-get install -q --no-install-recommends -y python3-numpy RUN apt-get install -q --no-install-recommends -y python3-scipy RUN apt-get install -q --no-install-recommends -y python3-pandas RUN apt-get install -q --no-install-recommends -y python3-sklearn RUN apt-get install -q --no-install-recommends -y python3-statsmodels RUN apt-get clean RUN python3 -m pip install -v --no-build-isolation --prefer-binary seaborn==0.10.1 scikit-learn matplotlib==3.4.3 ``` `pip freeze` gives ``` cycler==0.12.1 Cython==0.29.21 decorator==4.4.2 joblib==0.17.0 kiwisolver==1.4.5 matplotlib==3.4.3 numpy==1.19.5 packaging==23.2 pandas==1.1.5 patsy==0.5.3 Pillow==10.0.1 pyparsing==3.1.1 python-dateutil==2.8.1 pytz==2021.1 -e remodnav==1.0 scikit-learn==0.23.2 scipy==1.6.0 seaborn==0.10.1 six==1.16.0 statsmodels==0.14.0 ``` The diff of the statistical scores is **identical** compared to the bullseye container. Also the same SVG are modified (also looks identical inside). ### Ubuntu focal PY3.8 ```Dockerfile FROM ubuntu:focal ENV DEBIAN_FRONTEND=noninteractive RUN apt-get update -qq -y --allow-releaseinfo-change RUN apt-get install -q --no-install-recommends -y inkscape latexmk texlive-latex-extra python3-pip make RUN apt-get install -q --no-install-recommends -y build-essential python3-dev cython3 python3-setuptools python3-wheel RUN apt-get install -q --no-install-recommends -y python3-numpy RUN apt-get install -q --no-install-recommends -y python3-scipy RUN apt-get install -q --no-install-recommends -y python3-pandas RUN apt-get install -q --no-install-recommends -y python3-sklearn RUN apt-get install -q --no-install-recommends -y python3-statsmodels RUN apt-get clean RUN python3 -m pip install -v --no-build-isolation --prefer-binary seaborn==0.10.1 scikit-learn matplotlib==3.4.3 ``` `pip freeze` gives ``` cycler==0.12.1 Cython==0.29.14 decorator==4.4.2 joblib==0.14.0 kiwisolver==1.4.5 matplotlib==3.4.3 numpy==1.17.4 pandas==0.25.3 patsy==0.5.1 Pillow==10.0.1 pyparsing==3.1.1 python-dateutil==2.7.3 pytz==2019.3 -e remodnav==1.0 scikit-learn==0.22.2.post1 scipy==1.3.3 seaborn==0.10.1 six==1.14.0 statsmodels==0.11.1 ``` The diff of the statistical scores is **identical** compared to the bullseye *and* buster containers. Also the same SVG are modified (also looks identical inside). ## Conclusions - matplotlib must be pinned for reproducibility - it writes it's version into the SVGs - a wide variety of environments and software versions produces **exact** results. But none of them reproduces the results committed in the repository.
adswa commented 2023-10-11 10:34:27 +00:00 (Migrated from github.com)

reproduced the same diff as you with the python3.7 Docker image

reproduced the same diff as you with the python3.7 Docker image
mih commented 2023-10-11 10:59:57 +00:00 (Migrated from github.com)

For comparison: trying with a virtualenv, trying to go with whatever latest version that is still API compatible with the code.

$ virtualenv --python="$(which python3)" ${HOME}/env/remodnav-repro
$ . ~/env/remodnav-repro/bin/activate
$ python -m pip install numpy scipy pandas==1.5.3 seaborn scikit-learn matplotlib==3.4.3

The previously pinned seaborn 0.10.1 is incompatible with numpy 1.26, and had to be unpinned.

...
  File "/home/mih/env/remodnav-repro/lib/python3.11/site-packages/numpy/__init__.py", line 324, in __getattr__
    raise AttributeError(__former_attrs__[attr])
AttributeError: module 'numpy' has no attribute 'bool'.
`np.bool` was a deprecated alias for the builtin `bool`. To avoid this error in existing code, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
    https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations. Did you mean: 'bool_'?

Pandas had to be pinned to the last 1.x release. Pandas 2.1.1 incompatibility:

ValueError: Multi-dimensional indexing (e.g. `obj[:, None]`) is no longer supported. Convert to a numpy array before indexing instead.

pip freeze gives

cycler==0.12.1
joblib==1.3.2
kiwisolver==1.4.5
matplotlib==3.4.3
numpy==1.26.0
pandas==1.5.3
Pillow==10.0.1
pyparsing==3.1.1
python-dateutil==2.8.2
pytz==2023.3.post1
scikit-learn==1.3.1
scipy==1.11.3
seaborn==0.13.0
six==1.16.0
threadpoolctl==3.2.0

This REPRODUCES all stats exactly!!!

The remaining diff is in the SVGs

img/confusion_MN_AL.svg  | 24 ++++++++++++------------
img/confusion_MN_RA.svg  | 24 ++++++++++++------------
img/confusion_RA_AL.svg  | 24 ++++++++++++------------
img/hist_saccade_lab.svg |  8 ++++----
For comparison: trying with a virtualenv, trying to go with whatever latest version that is still API compatible with the code. ```sh $ virtualenv --python="$(which python3)" ${HOME}/env/remodnav-repro $ . ~/env/remodnav-repro/bin/activate $ python -m pip install numpy scipy pandas==1.5.3 seaborn scikit-learn matplotlib==3.4.3 ``` The previously pinned seaborn 0.10.1 is incompatible with numpy 1.26, and had to be unpinned. ``` ... File "/home/mih/env/remodnav-repro/lib/python3.11/site-packages/numpy/__init__.py", line 324, in __getattr__ raise AttributeError(__former_attrs__[attr]) AttributeError: module 'numpy' has no attribute 'bool'. `np.bool` was a deprecated alias for the builtin `bool`. To avoid this error in existing code, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here. The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations. Did you mean: 'bool_'? ``` Pandas had to be pinned to the last 1.x release. Pandas 2.1.1 incompatibility: ``` ValueError: Multi-dimensional indexing (e.g. `obj[:, None]`) is no longer supported. Convert to a numpy array before indexing instead. ``` `pip freeze` gives ``` cycler==0.12.1 joblib==1.3.2 kiwisolver==1.4.5 matplotlib==3.4.3 numpy==1.26.0 pandas==1.5.3 Pillow==10.0.1 pyparsing==3.1.1 python-dateutil==2.8.2 pytz==2023.3.post1 scikit-learn==1.3.1 scipy==1.11.3 seaborn==0.13.0 six==1.16.0 threadpoolctl==3.2.0 ``` **This REPRODUCES all stats exactly!!!** The remaining diff is in the SVGs ``` img/confusion_MN_AL.svg | 24 ++++++++++++------------ img/confusion_MN_RA.svg | 24 ++++++++++++------------ img/confusion_RA_AL.svg | 24 ++++++++++++------------ img/hist_saccade_lab.svg | 8 ++++---- ```
mih commented 2023-10-11 11:20:35 +00:00 (Migrated from github.com)

Trying to drill down on the SVG diff. I had the hunch that pinning the seaborn version is probably a more important aspect than being able to upgrade numpy. And indeed:

python -m pip install numpy==1.23.2 scipy pandas==1.5.3 seaborn==0.10.1 scikit-learn matplotlib==3.4.3

gives an environment that fully reproduces the stats, and the full remaining diff is:

diff --git a/img/hist_saccade_lab.svg b/img/hist_saccade_lab.svg
index 6ef426c..02bb011 100644
--- a/img/hist_saccade_lab.svg
+++ b/img/hist_saccade_lab.svg
@@ -199,16 +199,16 @@ z
    <g id="patch_23">
     <path clip-path="url(#p2f9441ee18)" d="M 157.6125 118.304175 
 L 163.1925 118.304175 
-L 163.1925 117.311604 
-L 157.6125 117.311604 
+L 163.1925 117.460043 
+L 157.6125 117.460043 
 z
 " style="fill:#808080;"/>
    </g>
    <g id="patch_24">
     <path clip-path="url(#p2f9441ee18)" d="M 163.1925 118.304175 
 L 168.7725 118.304175 
-L 168.7725 117.560194 
-L 163.1925 117.560194 
+L 168.7725 117.411756 
+L 163.1925 117.411756 
 z
 " style="fill:#808080;"/>
    </g>

Visually, this is the part of the figure that is different:

image

image

Closeups of the two versions of the figure at the difference (it is the height of the bar in the middle).

image

image

Trying to drill down on the SVG diff. I had the hunch that pinning the seaborn version is probably a more important aspect than being able to upgrade numpy. And indeed: ``` python -m pip install numpy==1.23.2 scipy pandas==1.5.3 seaborn==0.10.1 scikit-learn matplotlib==3.4.3 ``` gives an environment that fully reproduces the stats, and the full remaining diff is: ```diff diff --git a/img/hist_saccade_lab.svg b/img/hist_saccade_lab.svg index 6ef426c..02bb011 100644 --- a/img/hist_saccade_lab.svg +++ b/img/hist_saccade_lab.svg @@ -199,16 +199,16 @@ z <g id="patch_23"> <path clip-path="url(#p2f9441ee18)" d="M 157.6125 118.304175 L 163.1925 118.304175 -L 163.1925 117.311604 -L 157.6125 117.311604 +L 163.1925 117.460043 +L 157.6125 117.460043 z " style="fill:#808080;"/> </g> <g id="patch_24"> <path clip-path="url(#p2f9441ee18)" d="M 163.1925 118.304175 L 168.7725 118.304175 -L 168.7725 117.560194 -L 163.1925 117.560194 +L 168.7725 117.411756 +L 163.1925 117.411756 z " style="fill:#808080;"/> </g> ``` Visually, this is the part of the figure that is different: ![image](https://github.com/psychoinformatics-de/paper-remodnav/assets/136479/f9cce235-6588-465b-886c-355e5876d8c3) ![image](https://github.com/psychoinformatics-de/paper-remodnav/assets/136479/179db333-b062-4357-b67f-bdd766339a91) Closeups of the two versions of the figure at the difference (it is the height of the bar in the middle). ![image](https://github.com/psychoinformatics-de/paper-remodnav/assets/136479/b8eab672-f2b7-4e72-90d5-6dc9b986d8ad) ![image](https://github.com/psychoinformatics-de/paper-remodnav/assets/136479/648a80b2-fe43-42f8-a33f-60c5b03b33dd)
mih commented 2023-10-11 11:27:57 +00:00 (Migrated from github.com)

Here is the relevant code that is resonsible for this plot:

            fig = plt.figure(figsize=(3,2))
            plt.hist(ev_df['duration'].values,
                    bins='doane',
                    range=x_lim,
                    color='gray')
                    #log=True)
            plt.xlabel('{} duration in s'.format(label))
            plt.xlim(x_lim)
            plt.ylim(y_lim)
            plt.savefig(
                op.join(
                    'img',
                    'hist_{}_{}.svg'.format(
                        label,
                        ds_name)),
                transparent=True,
                bbox_inches="tight",
                metadata={'Date': None})

It is plain matplotlib. We know the matplotlib version that was originally used, it is included in the files RDF metadata:

    <dc:creator>
     <cc:Agent>
      <dc:title>Matplotlib v3.4.3, https://matplotlib.org/</dc:title>
     </cc:Agent>
    </dc:creator>

We have that exact version installed. but this obviously does not mean that we have the exact some binary running. Still weird to have this be the only difference.

Here is the relevant code that is resonsible for this plot: ```py fig = plt.figure(figsize=(3,2)) plt.hist(ev_df['duration'].values, bins='doane', range=x_lim, color='gray') #log=True) plt.xlabel('{} duration in s'.format(label)) plt.xlim(x_lim) plt.ylim(y_lim) plt.savefig( op.join( 'img', 'hist_{}_{}.svg'.format( label, ds_name)), transparent=True, bbox_inches="tight", metadata={'Date': None}) ``` It is plain matplotlib. We know the matplotlib version that was originally used, it is included in the files RDF metadata: ```xml <dc:creator> <cc:Agent> <dc:title>Matplotlib v3.4.3, https://matplotlib.org/</dc:title> </cc:Agent> </dc:creator> ``` We have that exact version installed. but this obviously does not mean that we have the exact some binary running. Still weird to have this be the only difference.
mih commented 2023-10-11 12:17:47 +00:00 (Migrated from github.com)

With this success, I am back in Docker land. Clearly the virtualenv has an impact. So let's try to put a (superfluous) virtualenv inside the docker container.

Known that we can use much more recent software, I am basing on Debian bookworm and use the versions for the previous non-docker exploration:

FROM debian:bookworm-slim
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update -qq -y --allow-releaseinfo-change
RUN apt-get install -q --no-install-recommends -y make
RUN apt-get install -q --no-install-recommends -y build-essential python3-dev
RUN apt-get install -q --no-install-recommends -y python3-virtualenv python3-wheel
RUN apt-get clean
RUN virtualenv --python="$(which python3)" /env/remodnav-repro
RUN sh -c ". /env/remodnav-repro/bin/activate; python -m pip install numpy==1.23.2 scipy pandas==1.5.3 seaborn==0.10.1 scikit-learn matplotlib==3.4.3 statsmodels"
RUN chmod -R ugo+rw /env/remodnav-repro
RUN rm -rf /root/.local /root/.cache /var/lib/apt/lists/deb.debian.org*
RUN find /env -type d -name __pycache__ -exec rm -rf {} \; -prune
RUN apt-get purge -y build-essential python3-dev
RUN apt-get clean

And indeed! It also arrives at the minimal diff shown in https://github.com/psychoinformatics-de/paper-remodnav/issues/20#issuecomment-1757462683

The image compresses down to 625MB.

With this success, I am back in Docker land. Clearly the virtualenv has an impact. So let's try to put a (superfluous) virtualenv *inside* the docker container. Known that we can use much more recent software, I am basing on Debian bookworm and use the versions for the previous non-docker exploration: ```Dockerfile FROM debian:bookworm-slim ENV DEBIAN_FRONTEND=noninteractive RUN apt-get update -qq -y --allow-releaseinfo-change RUN apt-get install -q --no-install-recommends -y make RUN apt-get install -q --no-install-recommends -y build-essential python3-dev RUN apt-get install -q --no-install-recommends -y python3-virtualenv python3-wheel RUN apt-get clean RUN virtualenv --python="$(which python3)" /env/remodnav-repro RUN sh -c ". /env/remodnav-repro/bin/activate; python -m pip install numpy==1.23.2 scipy pandas==1.5.3 seaborn==0.10.1 scikit-learn matplotlib==3.4.3 statsmodels" RUN chmod -R ugo+rw /env/remodnav-repro RUN rm -rf /root/.local /root/.cache /var/lib/apt/lists/deb.debian.org* RUN find /env -type d -name __pycache__ -exec rm -rf {} \; -prune RUN apt-get purge -y build-essential python3-dev RUN apt-get clean ``` And indeed! It also arrives at the minimal diff shown in https://github.com/psychoinformatics-de/paper-remodnav/issues/20#issuecomment-1757462683 The image compresses down to 625MB.
mih commented 2023-10-11 13:36:51 +00:00 (Migrated from github.com)

I can now confirm that the presence or absence of a virtualenv is irrelevant (as it should be). Here is another configuration that achieves the diff from https://github.com/psychoinformatics-de/paper-remodnav/issues/20#issuecomment-1757462683 without any virtualenv:

FROM debian:bookworm-slim
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update -qq -y --allow-releaseinfo-change
RUN apt-get install -q --no-install-recommends -y make
python3-dev cython3 python3-setuptools python3-wheel
RUN apt-get install -q --no-install-recommends -y build-essential python3-dev
RUN apt-get install -q --no-install-recommends -y python3-virtualenv python3-wheel
RUN apt-get install -q --no-install-recommends -y python3-scipy
RUN apt-get install -q --no-install-recommends -y python3-sklearn
RUN apt-get install -q --no-install-recommends -y python3-statsmodels
RUN apt-get install -q --no-install-recommends -y python3-kiwisolver
RUN apt-get install -q --no-install-recommends -y python3-pyparsing
RUN apt-get install -q --no-install-recommends -y python3-pil
RUN apt-get install -q --no-install-recommends -y python3-pip
RUN apt-get clean
RUN python3 -m pip install --break-system-packages numpy==1.23.2 pandas==1.5.3 seaborn==0.10.1 matplotlib==3.4.3
RUN rm -rf /root/.local /root/.cache /var/lib/apt/lists/deb.debian.org*
RUN apt-get purge -y build-essential
RUN apt-get autoremove -y
RUN apt-get clean
I can now confirm that the presence or absence of a virtualenv is irrelevant (as it should be). Here is another configuration that achieves the diff from https://github.com/psychoinformatics-de/paper-remodnav/issues/20#issuecomment-1757462683 without any virtualenv: ```Dockerfile FROM debian:bookworm-slim ENV DEBIAN_FRONTEND=noninteractive RUN apt-get update -qq -y --allow-releaseinfo-change RUN apt-get install -q --no-install-recommends -y make python3-dev cython3 python3-setuptools python3-wheel RUN apt-get install -q --no-install-recommends -y build-essential python3-dev RUN apt-get install -q --no-install-recommends -y python3-virtualenv python3-wheel RUN apt-get install -q --no-install-recommends -y python3-scipy RUN apt-get install -q --no-install-recommends -y python3-sklearn RUN apt-get install -q --no-install-recommends -y python3-statsmodels RUN apt-get install -q --no-install-recommends -y python3-kiwisolver RUN apt-get install -q --no-install-recommends -y python3-pyparsing RUN apt-get install -q --no-install-recommends -y python3-pil RUN apt-get install -q --no-install-recommends -y python3-pip RUN apt-get clean RUN python3 -m pip install --break-system-packages numpy==1.23.2 pandas==1.5.3 seaborn==0.10.1 matplotlib==3.4.3 RUN rm -rf /root/.local /root/.cache /var/lib/apt/lists/deb.debian.org* RUN apt-get purge -y build-essential RUN apt-get autoremove -y RUN apt-get clean ```
mih commented 2023-10-11 13:53:10 +00:00 (Migrated from github.com)

Probably the final post in this saga: The trigger for the reproducibility issue is who compiles numpy?

It depends on whether I am using a pip-compiled installation or one downloaded from Debian.

Either of these leads to reproducible results on their own, and that across a wide range of versions. But there is a noticeable difference in results across these means of compiling the sources.

Below is a complete Dockerfile for anyone interested in digging deeper. The key line is the specification of the numpy version. Whenever it is different from the numpy version provided by the respective Debian release (and it does not matter which one), pip will compile it, and it will reproduce the results published many years ago. So change

numpy==1.24.3

a version that is not in Debian bookwork to

numpy==1.24.2

a version that is in Debian bookworm, and the results will not reproduce. Make it 1.24.1 and they will reproduce again, because it will also be compiled locally.

Even when I set up a system like it would have existed at the time of publication (Debian buster), the results do not reproduce, unless pip compiles numpy.

FROM debian:bookworm-slim
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update -qq -y --allow-releaseinfo-change
RUN apt-get install -q --no-install-recommends -y make
python3-dev cython3 python3-setuptools python3-wheel
RUN apt-get install -q --no-install-recommends -y build-essential python3-dev
RUN apt-get install -q --no-install-recommends -y python3-virtualenv python3-wheel python3-scipy python3-sklearn python3-statsmodels python3-kiwisolver python3-pyparsing python3-pil python3-pip
RUN apt-get clean
RUN python3 -m pip install --break-system-packages numpy==1.24.3 pandas==1.5.3 seaborn==0.10.1 matplotlib==3.4.3
RUN rm -rf /root/.local /root/.cache /var/lib/apt/lists/deb.debian.org*
RUN apt-get purge -y build-essential
RUN apt-get autoremove -y
RUN apt-get clean
Probably the final post in this saga: The trigger for the reproducibility issue is **who compiles numpy?** It depends on whether I am using a pip-compiled installation or one downloaded from Debian. Either of these leads to reproducible results on their own, and that across a wide range of versions. But there is a noticeable difference in results across these means of compiling the sources. Below is a complete Dockerfile for anyone interested in digging deeper. The key line is the specification of the numpy version. Whenever it is different from the numpy version provided by the respective Debian release (and it does not matter which one), pip will compile it, and it will reproduce the results published many years ago. So change ``` numpy==1.24.3 ``` a version that is not in Debian bookwork to ``` numpy==1.24.2 ``` a version that is in Debian bookworm, and the results will not reproduce. Make it `1.24.1` and they will reproduce again, because it will also be compiled locally. Even when I set up a system like it would have existed at the time of publication (Debian buster), the results do not reproduce, unless pip compiles numpy. ```Dockerfile FROM debian:bookworm-slim ENV DEBIAN_FRONTEND=noninteractive RUN apt-get update -qq -y --allow-releaseinfo-change RUN apt-get install -q --no-install-recommends -y make python3-dev cython3 python3-setuptools python3-wheel RUN apt-get install -q --no-install-recommends -y build-essential python3-dev RUN apt-get install -q --no-install-recommends -y python3-virtualenv python3-wheel python3-scipy python3-sklearn python3-statsmodels python3-kiwisolver python3-pyparsing python3-pil python3-pip RUN apt-get clean RUN python3 -m pip install --break-system-packages numpy==1.24.3 pandas==1.5.3 seaborn==0.10.1 matplotlib==3.4.3 RUN rm -rf /root/.local /root/.cache /var/lib/apt/lists/deb.debian.org* RUN apt-get purge -y build-essential RUN apt-get autoremove -y RUN apt-get clean ```
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
remodnav/paper#20
No description provided.