Copy from `hub.datalad/datalink/org`: Neurobagel integration notes #16

New issue

Open

opened 2025-03-01 12:17:18 +00:00 by jsheunis · 0 comments

jsheunis commented

2025-03-01 12:17:18 +00:00

Owner

Source: https://hub.datalad.org/datalink/org/issues/2

Notes: updates

The main reason for deploying a (variant of a) neurobagel instance is data discoverability. This is applicable for many scientific data use cases, but specifically for INM-7 and TRR379.

Neurobagel is a graph querying tool for building cohorts from multiple human neuroimaging datasets. It queries a metadata graph to find data at the subject level that have been annotated with terms according to a pre-specified data dictionary. If we define our own data dictionary with terms that we also derive from our own DataLad dataset model specification it would mean that queries can be run on a graph to put together a cohort for which a DataLad dataset could be generated programmatically on-demand, i.e. establishing a metadata query to actionable dataset pipeline. This means a secondary, but still huge IHMO, benefit of this would be dataset generation (in addition to discoverability).

Useful links:

Website: https://neurobagel.org
Demo: https://query.neurobagel.org
Data dictionary docs: https://neurobagel.org/dictionaries/
Toy examples: https://github.com/neurobagel/neurobagel_examples/tree/main/data-upload
Annotated OpenNeuro datasets: https://github.com/neurobagel/openneuro-annotations
Pydantic model implementation: https://github.com/neurobagel/bagel-cli/blob/main/bagel/dictionary_models.py

Integration with the DataLad world

We have ongoing efforts related to metadata:

datalad-concepts to model a complete DataLad dataset or generic data distribution and to create use-case specific schemas (think a DataLad-compatible schema for a TRR379 dataset)
Schema-driven user interfaces with shacl-vue, meaning with a SHACL export of a specific dataset schema we can automatically generate self-validating forms for users to capture metadata about their datasets, with the result being a graph with valid semantic metadata.

If we can:

Define a schema for a dataset (with subjects) using DLCO, taking care to identify/annotate the fields that are to be used in the "data dictionary" for neurobagel
Find a way to export a (partial view of a) schema to a "data dictionary" that can feed Neurobagel
Annotate all datasets we care about with metadata using forms generated from the same DLCO schema
Make all this metadata available to Neurobagel for querying

we would have created a very capable and versatile framework.

Possible next steps

Put together a data dictionary of terms, deriving first from the default neurobagel data dictionary and augmented BIDS terms. But also collating terms that would be generally useful for datasets in the INM-7, TRR379, SFB1451, ABCD-J world. Related issue at: https://hub.datalad.org/datalink/org/issues/1
With the data dictionary as one source, we need to create a DLCO-based schema for a dataset (with subjects).
The schema should then be used to generate web-based forms in order to create the annotation metadata for a group of datasets.
We need to deploy NeuroBagel in a way that it is disposable and its graph DB can be rebuild programmatically whenever we feel like.

More notes:

Based on an initial look into the neurobagel data dictionary, the concept looks pretty straightforward and quite simple/small. The goal for us would be to grow it to a degree where it can exploit what we have
LinkML is currently the proposed schema authoring. However, this could eventually be replaced with a web-based form if https://github.com/psychoinformatics-de/shacl-vue/issues/50 pans out.

Update 2024-09-11

In the end, our pipeline has to end up with data that would be valid for a neurobagel graph. See also: Neurobagel graph data files. Looking at thedocs and neurobagel examples, it looks like they generate this as jsonld (e.g: example_synthetic.jsonld and example_synthetic_pheno-bids.jsonld ). The python-based bagel-cli can be used to generate graph-ready data.

https://github.com/neurobagel/bagel-cli/blob/main/bagel/cli.py#L79-L85:

Process a tabular phenotypic file (.tsv) that has been successfully annotated
with the Neurobagel annotation tool. The annotations are expected to be stored
in a data dictionary (.json).

This command will create a valid, subject-level instance of the Neurobagel
graph data model for the provided phenotypic file in the .jsonld format.
You can upload this .jsonld file to the Neurobagel graph.

bagel-cli has a Pydantic implementation of the data dictionary schema as well as a somewhat implicit schema for datasets/subjects/samples/sessions/etc (I say implicit because I couldn't find this covered explicitly in the docs, but I could be wrong), and uses both these models/schemas to transform the provided participants TSV file and data dictionary into graph-ready data. The second schema is where the jsonld terms like hasSession, hasAcquisition etc originate.

So if we want our pipeline to end up with neurobagel-graph-ready data, our schema also needs to model these classes/terms and their relationships to other classes in our schema. If we don't do this, we would only need to model the data dictionary terms and the basic columns of neurobagel's TSV files (participant_is, session_id, etc) in order to generate the equivalent of neurobagel TSV files, and then we would need to depend on bagel-cli to convert this to graph-ready files.

Source: https://hub.datalad.org/datalink/org/issues/2 --- Notes: [updates](#update-2024-09-11) The main reason for deploying a (variant of a) neurobagel instance is **data discoverability**. This is applicable for many scientific data use cases, but specifically for INM-7 and [TRR379]([url](https://www.trr379.de/)). Neurobagel is a graph querying tool for building cohorts from multiple human neuroimaging datasets. It queries a metadata graph to find data at the subject level that have been annotated with terms according to a pre-specified data dictionary. If we define our own data dictionary with terms that we also derive from our own [DataLad dataset model specification]([url](https://github.com/psychoinformatics-de/datalad-concepts)) it would mean that queries can be run on a graph to put together a cohort for which a DataLad dataset could be generated programmatically on-demand, i.e. establishing a metadata query to actionable dataset pipeline. This means a secondary, but still huge IHMO, benefit of this would be **dataset generation** (in addition to discoverability). Useful links: - Website: https://neurobagel.org - Demo: https://query.neurobagel.org - Data dictionary docs: https://neurobagel.org/dictionaries/ - Toy examples: https://github.com/neurobagel/neurobagel_examples/tree/main/data-upload - Annotated OpenNeuro datasets: https://github.com/neurobagel/openneuro-annotations - Pydantic model implementation: https://github.com/neurobagel/bagel-cli/blob/main/bagel/dictionary_models.py ### Integration with the DataLad world We have ongoing efforts related to metadata: - [datalad-concepts]([url](https://github.com/psychoinformatics-de/datalad-concepts)) to model a complete DataLad dataset or generic data distribution and to create use-case specific schemas (think a DataLad-compatible schema for a TRR379 dataset) - Schema-driven user interfaces with [shacl-vue](https://psychoinformatics-de.github.io/shacl-vue/docs/), meaning with a SHACL export of a specific dataset schema we can automatically generate self-validating forms for users to capture metadata about their datasets, with the result being a graph with valid semantic metadata. If we can: 1. Define a schema for a dataset (with subjects) using DLCO, taking care to identify/annotate the fields that are to be used in the "data dictionary" for neurobagel 2. Find a way to export a (partial view of a) schema to a "data dictionary" that can feed Neurobagel 3. Annotate all datasets we care about with metadata using forms generated from the same DLCO schema 4. Make all this metadata available to Neurobagel for querying we would have created a very capable and versatile framework. ### Possible next steps 1. Put together a data dictionary of terms, deriving first from the default neurobagel data dictionary and augmented BIDS terms. But also collating terms that would be generally useful for datasets in the INM-7, TRR379, SFB1451, ABCD-J world. Related issue at: https://hub.datalad.org/datalink/org/issues/1 2. With the data dictionary as one source, we need to create a DLCO-based schema for a dataset (with subjects). 3. The schema should then be used to generate web-based forms in order to create the annotation metadata for a group of datasets. 4. We need to deploy NeuroBagel in a way that it is disposable and its graph DB can be rebuild programmatically whenever we feel like. ### More notes: - Based on an initial look into the neurobagel data dictionary, the concept looks pretty straightforward and quite simple/small. The goal for us would be to grow it to a degree where it can exploit what we have - LinkML is currently the proposed schema authoring. However, this could eventually be replaced with a web-based form if https://github.com/psychoinformatics-de/shacl-vue/issues/50 pans out. <div id="update-2024-09-11"></div> ### Update 2024-09-11 In the end, our pipeline has to end up with data that would be valid for a neurobagel graph. See also: [Neurobagel graph data files](https://neurobagel.org/graph_data/). Looking at thedocs and neurobagel examples, it looks like they generate this as `jsonld` (e.g: [example_synthetic.jsonld](https://github.com/neurobagel/neurobagel_examples/blob/main/data-upload/example_synthetic.jsonld) and [example_synthetic_pheno-bids.jsonld](https://github.com/neurobagel/neurobagel_examples/blob/main/data-upload/pheno-bids-output/example_synthetic_pheno-bids.jsonld) ). The python-based [`bagel-cli`](https://github.com/neurobagel/bagel-cli) can be used to generate graph-ready data. https://github.com/neurobagel/bagel-cli/blob/main/bagel/cli.py#L79-L85: Process a tabular phenotypic file (.tsv) that has been successfully annotated with the Neurobagel annotation tool. The annotations are expected to be stored in a data dictionary (.json). This command will create a valid, subject-level instance of the Neurobagel graph data model for the provided phenotypic file in the .jsonld format. You can upload this .jsonld file to the Neurobagel graph. `bagel-cli` has a [Pydantic implementation of the data dictionary schema](https://github.com/neurobagel/bagel-cli/blob/main/bagel/dictionary_models.py) as well as a [somewhat implicit schema for datasets/subjects/samples/sessions/etc](https://github.com/neurobagel/bagel-cli/blob/main/bagel/models.py) (I say implicit because I couldn't find this covered explicitly in the docs, but I could be wrong), and uses both these models/schemas to transform the provided participants TSV file and data dictionary into graph-ready data. The second schema is where the jsonld terms like `hasSession`, `hasAcquisition` etc originate. So if we want our pipeline to end up with neurobagel-graph-ready data, our schema also needs to model these classes/terms and their relationships to other classes in our schema. If we don't do this, we would only need to model the data dictionary terms and the basic columns of neurobagel's TSV files (participant_is, session_id, etc) in order to generate the equivalent of neurobagel TSV files, and then we would need to depend on `bagel-cli` to convert this to graph-ready files.