No description
Find a file
2025-05-05 16:28:24 +02:00
dataladify_dicom_dataset.py Add first version of dataladify code 2025-04-12 00:05:04 +02:00
README.md Add a readme 2025-05-05 16:28:24 +02:00
requirements.txt Add a requirements file 2025-04-12 00:40:59 +02:00

Utilities for managing DICOM datasets

This repository contains a utility script for creating a DataLad dataset out of the selected DICOM folder (in-place).

Overview

The process starts by creating a DataLad dataset in the chosen directory, and saving the contents with DataLad. The files are then packed into a tar archive (tarball) which also gets saved in the dataset. Then, datalad addurls is used to record file availability in the tar archive (with datalad-next's archivist special remote). In the same step, selected properties are extracted from the DICOM header and stored as git-annex metadata. Finally, all annex keys except the tarball are dropped.

This approach is derived from the INM-ICF utilities, and parts of the code are reused.

The resulting dataset is ready to be pushed to a desired location, such as a forgejo instance.

Details

Why create the tar archive stored in the same repository if the files could be saved directly?

This is done to reduce the number of annex keys which need to be stored alongside the Git repository to just one (the archive). Because the Git repository can also be packed (by standard Git means), it makes it viable to store such datasets on filesystems with inode limitations.

How is the archive generated?

The tar file is generated in a reproducible way (ensuring the same checksum when re-running) by standardizing file permissions, ownership, and timestamp in the file information fed into tar. Notably, file timestamps are set to match StudyDate and StudyTime from the DICOM header.

The original directory layout is preserved inside the archive.

What metadata are stored in the Git repository?

A hardcoded set of metadata fields describing the acquisition (SeriesDescription, SeriesNumber, Modality, MRAcquisitionType, ProtocolName, PulseSequenceName) is extracted from the DICOM header and stored as git-annex metadata. These will be available in the Git repository (ie. also in the absence of annexed file contents). Any other properties must be read from the DICOM headers, and doing so requires access to the annexed file contents.

Storing these properties should enable describing the acquisitions beyond the original folder names, without revealing potentially sensitive information. An example of using git-annex metadata can be found in the INM-ICF docs.

Usage

Installation

The code is written in Python. It is recommended to install the dependencies in a new virtual environment, which can be created with, e.g.,:

virtualenv --python=python3 ~/env/trr-dicom-utilities
source ~/env/trr-dicom-utilities/bin/activate

The code uses the DataLad-next extension. The extension should be enabled in the Git config:

git config --global --add datalad.extensions.load next

Clone the repository, and install the requirements:

pip install -r requirements.txt

Running

Run as a Python script. Use python dataladify_dicom_dataset.py --help to display a help message.