Can the Neurobagel data structure and query interface be customized (or how complicated would it be to do so?) #5

New issue

Open

opened 2024-10-08 08:53:23 +00:00 by jsheunis · 0 comments

jsheunis commented

2024-10-08 08:53:23 +00:00

(Migrated from hub.datalad.org)

A question that has come up in discussion is:

can Neurobagel tooling practically take any data dictionary we cook up?

The data dictionary defines the semantic annotations of the columns in the Neurobagel TSV file, so my understanding is that we could technically include any arbitrary columns and annotations as long as we stick to the data dictionary specification (i.e. only categorical columns, continuous columns, or identifier columns). What I am not sure about is the nature of the identifier columns. From my understanding of the docs about the Neurobagel TSV file, rows are equivalent to "particiant-sessions", i.e. there are only two identifier columns (Identifies: participant and Identifies: session). Is this a hard requirement for bagel-cli and the query tool? Or can we include an arbitrary amount of identifier columns (a single one, or many)? If possible, how will the query interface deal with this? Automatically, or will it need development to deal with the changes? I assume that e.g. Identifies: participant has some internal mapping used in the process of generating graph-ready data, so if we e.g. say Identifies: sample or Identifies: cuteLittlePuppy the process will fail?

As noted at the end of this comment https://hub.datalad.org/datalink/org/issues/2#issue-21, my understanding is that neurobagel has its own internal schema for subjects, sessions, images, etc., which I assume follows BIDS to a major extent. I understand that the bagel-cli can be used to generate phenotypic-only graph-ready data, i.e. a BIDS dataset does not have to accompany the process. But what happens if we still have an accompanying scientific dataset that does not conform to BIDS but we still want to make some/all of its aspects/content findable in neurobagel node via the query interface. E.g. DNA sequencing or flow cytometry data. Some aspects might be able to be mapped onto the "TSV-file/data-dictionary" paradigm as new columns, but others not.

So in summary, will neurobagel components be able to deal with this. If not out of the box, how complicated would it be to be customized? Or would it not be customizable at all?

Note: issue repeated here: https://github.com/neurobagel/query-tool/issues/307

A question that has come up in discussion is: > can Neurobagel tooling practically take any data dictionary we cook up? The data dictionary defines the semantic annotations of the columns in the Neurobagel TSV file, so my understanding is that we could technically include any arbitrary columns and annotations as long as we stick to the data dictionary specification (i.e. only categorical columns, continuous columns, or identifier columns). What I am not sure about is the nature of the identifier columns. From my understanding of the docs about the Neurobagel TSV file, rows are equivalent to "particiant-sessions", i.e. there are only two identifier columns (`Identifies: participant` and `Identifies: session`). Is this a hard requirement for `bagel-cli` and the query tool? Or can we include an arbitrary amount of identifier columns (a single one, or many)? If possible, how will the query interface deal with this? Automatically, or will it need development to deal with the changes? I assume that e.g. `Identifies: participant` has some internal mapping used in the process of generating graph-ready data, so if we e.g. say `Identifies: sample` or `Identifies: cuteLittlePuppy` the process will fail? As noted at the end of this comment https://hub.datalad.org/datalink/org/issues/2#issue-21, my understanding is that neurobagel has its own internal schema for subjects, sessions, images, etc., which I assume follows BIDS to a major extent. I understand that the `bagel-cli` can be used to generate phenotypic-only graph-ready data, i.e. a BIDS dataset does not have to accompany the process. But what happens if we still have an accompanying scientific dataset that does not conform to BIDS but we still want to make some/all of its aspects/content findable in neurobagel node via the query interface. E.g. DNA sequencing or flow cytometry data. Some aspects might be able to be mapped onto the "TSV-file/data-dictionary" paradigm as new columns, but others not. So in summary, will neurobagel components be able to deal with this. If not out of the box, how complicated would it be to be customized? Or would it not be customizable at all? _Note: issue repeated here: https://github.com/neurobagel/query-tool/issues/307_