Convert BIDS datasets to `flat-data`

jsheunis commented

2025-07-02 08:40:22 +00:00

Owner

The conversion of the penguins dataset led to the script and helper metadata here: https://hub.datalad.org/edu/penguins/src/branch/main/code

This could be generalized more, to allow the same script to be used for BIDS datasets.

The conversion of the penguins dataset led to the script and helper metadata here: https://hub.datalad.org/edu/penguins/src/branch/main/code This could be generalized more, to allow the same script to be used for BIDS datasets.

jsheunis commented

2025-07-02 13:27:30 +00:00

Author

Owner

Looking at the datasets that we have internally, there are a handful that have all three of these files:

dataset_description.json
participants.tsv
participants.json

It makes sense to me to start here since they would have the best way to verify the column headings in participants.tsv file. Of these datasets, here are some common column headers of the participants.tsv files:

participant_id
site
age
sex
group

(Age is sometimes reported in months, sometimes in years.)

Looking at dataset_description.json files, these are common dataset-level properties:

Acknowledgements
Authors
BIDSVersion
DatasetDOI
DatasetType
EthicsApprovals
Funding
HowToAcknowledge
License
Name
ReferencesAndLinks
GeneratedBy
Ethics

I

Looking at the datasets that we have internally, there are a handful that have all three of these files: - `dataset_description.json` - `participants.tsv` - `participants.json` It makes sense to me to start here since they would have the best way to verify the column headings in participants.tsv file. Of these datasets, here are some common column headers of the `participants.tsv` files: - participant_id - site - age - sex - group (Age is sometimes reported in months, sometimes in years.) Looking at `dataset_description.json` files, these are common dataset-level properties: - Acknowledgements - Authors - BIDSVersion - DatasetDOI - DatasetType - EthicsApprovals - Funding - HowToAcknowledge - License - Name - ReferencesAndLinks - GeneratedBy - Ethics I

jsheunis commented

2025-07-02 14:32:10 +00:00

Author

Owner

Here's the structure that I can determine so far:


classDiagram

    class MyDataset {
        is_a: Dataset
        name: Name
        doi: DatasetDOI
        comments: Acknowledgements + HowToAcknowledge
    }
    class SexOfSubjectX {
        is_a: DataItem
    }
    class AgeOfSubjectX {
        is_a: DataItem
    }
    class Sex {
        is_a: Dimension
    }
    class Age {
        is_a: Dimension
    }
    class Site {
        is_a: Factor
    }

    class Years {
        is_a: Unit
        short_name: "yr"
        description: "..."
    }
    class Months {
        is_a: Unit
        short_name: "mo"
        description: "..."
    }
    class SubjectX {
        is_a: Subject
    }
    
    class DeterminingAgeOfSubjectX {
        is_a: StudyActivity
    }
    class DeterminingSexOfSubjectX {
        is_a: StudyActivity
    }

    class Human {
        is_a: SubjectType
    }
    
    
    DeterminingAgeOfSubjectX <.. AgeOfSubjectX : generated_by
    DeterminingSexOfSubjectX <.. SexOfSubjectX : generated_by
    
    MyDataset <.. SexOfSubjectX : part_of
    MyDataset <.. AgeOfSubjectX : part_of

    MyDataset ..> Sex : outcome_variable
    MyDataset ..> Age : outcome_variable

    SexOfSubjectX ..> Sex : outcome_variable
    AgeOfSubjectX ..> Age : outcome_variable

    SexOfSubjectX ..> SubjectX : derived_from
    AgeOfSubjectX ..> SubjectX : derived_from
    
    AgeOfSubjectX ..> Months : has_unit
    
    DeterminingAgeOfSubjectX ..> SubjectX : studied_subjects
    DeterminingAgeOfSubjectX ..> Site : influencing_factors
    DeterminingSexOfSubjectX ..> SubjectX : studied_subjects
    DeterminingSexOfSubjectX ..> Site : influencing_factors

    SubjectX ..> Human : is_of_subject_type

Here's the structure that I can determine so far: ```mermaid classDiagram class MyDataset { is_a: Dataset name: Name doi: DatasetDOI comments: Acknowledgements + HowToAcknowledge } class SexOfSubjectX { is_a: DataItem } class AgeOfSubjectX { is_a: DataItem } class Sex { is_a: Dimension } class Age { is_a: Dimension } class Site { is_a: Factor } class Years { is_a: Unit short_name: "yr" description: "..." } class Months { is_a: Unit short_name: "mo" description: "..." } class SubjectX { is_a: Subject } class DeterminingAgeOfSubjectX { is_a: StudyActivity } class DeterminingSexOfSubjectX { is_a: StudyActivity } class Human { is_a: SubjectType } DeterminingAgeOfSubjectX <.. AgeOfSubjectX : generated_by DeterminingSexOfSubjectX <.. SexOfSubjectX : generated_by MyDataset <.. SexOfSubjectX : part_of MyDataset <.. AgeOfSubjectX : part_of MyDataset ..> Sex : outcome_variable MyDataset ..> Age : outcome_variable SexOfSubjectX ..> Sex : outcome_variable AgeOfSubjectX ..> Age : outcome_variable SexOfSubjectX ..> SubjectX : derived_from AgeOfSubjectX ..> SubjectX : derived_from AgeOfSubjectX ..> Months : has_unit DeterminingAgeOfSubjectX ..> SubjectX : studied_subjects DeterminingAgeOfSubjectX ..> Site : influencing_factors DeterminingSexOfSubjectX ..> SubjectX : studied_subjects DeterminingSexOfSubjectX ..> Site : influencing_factors SubjectX ..> Human : is_of_subject_type ```

jsheunis commented

2025-07-02 14:56:50 +00:00

Author

Owner

From #17 (comment):

The following are either missing in that the flat schema of Dataset does not include a specific property for it, or otherwise it is not immediately clear to me how these would be annotated in a hierarchical/linking sense:

author(s) - definitely necessary

homepage - could be convenient, but could also be achieved by exactMatch, or a "see-also" annotation

doi - could be convenient, but could also be achieved by adding an issuedIdentifier

license - not sure if this should rather be on the Distribution class though

funding - I am not sure where exactly this should be represented, perhaps on a different related class (like Study?)

publications - I am not sure where exactly this should be represented

I think the following properties from the dataset_description.json files would be good to include in this first data annotation round:

Acknowledgements: could go into the existing comments field
Authors: a new multivalued slot on Dataset with range Person?

DatasetDOI: a new slot on Dataset with range IssuedIdentifier? or string as per

   doi:
     description: >-
       Associated Digital Object Identifier (DOI; ISO 26324; see
       https://doi.org).  The value must be just the DOI without the URL
       project. So just `10.1038/s41597-022-01163-2` and not
       `https://doi.org/10.1038/s41597-022-01163-2`.
     range: string

Funding: maybe a new funding field. Could be range string to simplify it, or same as:
inm7/inm7-concepts – src/simpleinput/unreleased.yaml
Lines 222 to 226 in inm7/inm7-concepts@44a1cf7
funding:
description: >-
Grant that provides resources for a project.
range: Grant
multivalued: true
HowToAcknowledge: could go into the existing comments field
License: my feeling is this would be a property of a distribution, so not immediately applicable if we don't annotate that yet
Name: maps directly to existing name field on Dataset
ReferencesAndLinks: perhaps it would be useful to have a generic see_also multivalued field to house these items? These could technically also be included as curation comments, although i feel seeAlso fits particularly well. Maybe the difference is the type, i.e. string vs uri.

From https://hub.psychoinformatics.de/inm7/annotate.inm7.de-data/issues/17#issue-1156: > The following are either missing in that the flat schema of `Dataset` does not include a specific property for it, or otherwise it is not immediately clear to me how these would be annotated in a hierarchical/linking sense: > > - `author(s)` - definitely necessary > - `homepage` - could be convenient, but could also be achieved by `exactMatch`, or a "see-also" annotation > - `doi` - could be convenient, but could also be achieved by adding an `issuedIdentifier` > - `license` - not sure if this should rather be on the `Distribution` class though > - `funding` - I am not sure where exactly this should be represented, perhaps on a different related class (like `Study`?) > - `publications` - I am not sure where exactly this should be represented I think the following properties from the `dataset_description.json` files would be good to include in this first data annotation round: - `Acknowledgements`: could go into the existing `comments` field - `Authors`: a new multivalued slot on `Dataset` with range `Person`? - `DatasetDOI`: a new slot on `Dataset` with range `IssuedIdentifier`? or `string` as per https://hub.psychoinformatics.de/inm7/inm7-concepts/src/commit/44a1cf729240568e196119ff5fee961380b6654a/src/simpleinput/unreleased.yaml#L202-L208 - `Funding`: maybe a new funding field. Could be range string to simplify it, or same as: https://hub.psychoinformatics.de/inm7/inm7-concepts/src/commit/44a1cf729240568e196119ff5fee961380b6654a/src/simpleinput/unreleased.yaml#L222-L226 - `HowToAcknowledge`: could go into the existing `comments` field - `License`: my feeling is this would be a property of a distribution, so not immediately applicable if we don't annotate that yet - `Name`: maps directly to existing name field on `Dataset` - `ReferencesAndLinks`: perhaps it would be useful to have a generic `see_also` multivalued field to house these items? These could technically also be included as curation comments, although i feel _[seeAlso](https://www.w3.org/TR/rdf-schema/#ch_seealso)_ fits particularly well. Maybe the difference is the type, i.e. string vs uri.

jsheunis commented

2025-07-03 08:59:29 +00:00

Author

Owner

Internal discussions say:

subject, sex, age are the important aspects to start with
let's try and break the stack with this
the rest of the slots/properties can still be added afterwards, once their context and relevance are better understood

Internal discussions say: - subject, sex, age are the important aspects to start with - let's try and break the stack with this - the rest of the slots/properties can still be added afterwards, once their context and relevance are better understood

👍 1

   funding:
     description: >-
       Grant that provides resources for a project.
     range: Grant
     multivalued: true

Rows
Columns

Convert BIDS datasets to flat-data #22

Convert BIDS datasets to `flat-data` #22