Convert BIDS datasets to flat-data #22

Open
opened 2025-07-02 08:40:22 +00:00 by jsheunis · 4 comments
Owner

The conversion of the penguins dataset led to the script and helper metadata here: https://hub.datalad.org/edu/penguins/src/branch/main/code

This could be generalized more, to allow the same script to be used for BIDS datasets.

The conversion of the penguins dataset led to the script and helper metadata here: https://hub.datalad.org/edu/penguins/src/branch/main/code This could be generalized more, to allow the same script to be used for BIDS datasets.
Author
Owner

Looking at the datasets that we have internally, there are a handful that have all three of these files:

  • dataset_description.json
  • participants.tsv
  • participants.json

It makes sense to me to start here since they would have the best way to verify the column headings in participants.tsv file. Of these datasets, here are some common column headers of the participants.tsv files:

  • participant_id
  • site
  • age
  • sex
  • group

(Age is sometimes reported in months, sometimes in years.)

Looking at dataset_description.json files, these are common dataset-level properties:

  • Acknowledgements
  • Authors
  • BIDSVersion
  • DatasetDOI
  • DatasetType
  • EthicsApprovals
  • Funding
  • HowToAcknowledge
  • License
  • Name
  • ReferencesAndLinks
  • GeneratedBy
  • Ethics

I

Looking at the datasets that we have internally, there are a handful that have all three of these files: - `dataset_description.json` - `participants.tsv` - `participants.json` It makes sense to me to start here since they would have the best way to verify the column headings in participants.tsv file. Of these datasets, here are some common column headers of the `participants.tsv` files: - participant_id - site - age - sex - group (Age is sometimes reported in months, sometimes in years.) Looking at `dataset_description.json` files, these are common dataset-level properties: - Acknowledgements - Authors - BIDSVersion - DatasetDOI - DatasetType - EthicsApprovals - Funding - HowToAcknowledge - License - Name - ReferencesAndLinks - GeneratedBy - Ethics I
Author
Owner

Here's the structure that I can determine so far:


classDiagram

    class MyDataset {
        is_a: Dataset
        name: Name
        doi: DatasetDOI
        comments: Acknowledgements + HowToAcknowledge
    }
    class SexOfSubjectX {
        is_a: DataItem
    }
    class AgeOfSubjectX {
        is_a: DataItem
    }
    class Sex {
        is_a: Dimension
    }
    class Age {
        is_a: Dimension
    }
    class Site {
        is_a: Factor
    }

    class Years {
        is_a: Unit
        short_name: "yr"
        description: "..."
    }
    class Months {
        is_a: Unit
        short_name: "mo"
        description: "..."
    }
    class SubjectX {
        is_a: Subject
    }
    
    class DeterminingAgeOfSubjectX {
        is_a: StudyActivity
    }
    class DeterminingSexOfSubjectX {
        is_a: StudyActivity
    }

    class Human {
        is_a: SubjectType
    }
    
    
    DeterminingAgeOfSubjectX <.. AgeOfSubjectX : generated_by
    DeterminingSexOfSubjectX <.. SexOfSubjectX : generated_by
    
    MyDataset <.. SexOfSubjectX : part_of
    MyDataset <.. AgeOfSubjectX : part_of

    MyDataset ..> Sex : outcome_variable
    MyDataset ..> Age : outcome_variable

    SexOfSubjectX ..> Sex : outcome_variable
    AgeOfSubjectX ..> Age : outcome_variable

    SexOfSubjectX ..> SubjectX : derived_from
    AgeOfSubjectX ..> SubjectX : derived_from
    
    AgeOfSubjectX ..> Months : has_unit
    
    DeterminingAgeOfSubjectX ..> SubjectX : studied_subjects
    DeterminingAgeOfSubjectX ..> Site : influencing_factors
    DeterminingSexOfSubjectX ..> SubjectX : studied_subjects
    DeterminingSexOfSubjectX ..> Site : influencing_factors

    SubjectX ..> Human : is_of_subject_type

Here's the structure that I can determine so far: ```mermaid classDiagram class MyDataset { is_a: Dataset name: Name doi: DatasetDOI comments: Acknowledgements + HowToAcknowledge } class SexOfSubjectX { is_a: DataItem } class AgeOfSubjectX { is_a: DataItem } class Sex { is_a: Dimension } class Age { is_a: Dimension } class Site { is_a: Factor } class Years { is_a: Unit short_name: "yr" description: "..." } class Months { is_a: Unit short_name: "mo" description: "..." } class SubjectX { is_a: Subject } class DeterminingAgeOfSubjectX { is_a: StudyActivity } class DeterminingSexOfSubjectX { is_a: StudyActivity } class Human { is_a: SubjectType } DeterminingAgeOfSubjectX <.. AgeOfSubjectX : generated_by DeterminingSexOfSubjectX <.. SexOfSubjectX : generated_by MyDataset <.. SexOfSubjectX : part_of MyDataset <.. AgeOfSubjectX : part_of MyDataset ..> Sex : outcome_variable MyDataset ..> Age : outcome_variable SexOfSubjectX ..> Sex : outcome_variable AgeOfSubjectX ..> Age : outcome_variable SexOfSubjectX ..> SubjectX : derived_from AgeOfSubjectX ..> SubjectX : derived_from AgeOfSubjectX ..> Months : has_unit DeterminingAgeOfSubjectX ..> SubjectX : studied_subjects DeterminingAgeOfSubjectX ..> Site : influencing_factors DeterminingSexOfSubjectX ..> SubjectX : studied_subjects DeterminingSexOfSubjectX ..> Site : influencing_factors SubjectX ..> Human : is_of_subject_type ```
Author
Owner

From #17 (comment):

The following are either missing in that the flat schema of Dataset does not include a specific property for it, or otherwise it is not immediately clear to me how these would be annotated in a hierarchical/linking sense:

  • author(s) - definitely necessary
  • homepage - could be convenient, but could also be achieved by exactMatch, or a "see-also" annotation
  • doi - could be convenient, but could also be achieved by adding an issuedIdentifier
  • license - not sure if this should rather be on the Distribution class though
  • funding - I am not sure where exactly this should be represented, perhaps on a different related class (like Study?)
  • publications - I am not sure where exactly this should be represented

I think the following properties from the dataset_description.json files would be good to include in this first data annotation round:

  • Acknowledgements: could go into the existing comments field
  • Authors: a new multivalued slot on Dataset with range Person?
  • DatasetDOI: a new slot on Dataset with range IssuedIdentifier? or string as per
    doi:
    description: >-
    Associated Digital Object Identifier (DOI; ISO 26324; see
    https://doi.org). The value must be just the DOI without the URL
    project. So just `10.1038/s41597-022-01163-2` and not
    `https://doi.org/10.1038/s41597-022-01163-2`.
    range: string
  • Funding: maybe a new funding field. Could be range string to simplify it, or same as:
    funding:
    description: >-
    Grant that provides resources for a project.
    range: Grant
    multivalued: true
  • HowToAcknowledge: could go into the existing comments field
  • License: my feeling is this would be a property of a distribution, so not immediately applicable if we don't annotate that yet
  • Name: maps directly to existing name field on Dataset
  • ReferencesAndLinks: perhaps it would be useful to have a generic see_also multivalued field to house these items? These could technically also be included as curation comments, although i feel seeAlso fits particularly well. Maybe the difference is the type, i.e. string vs uri.
From https://hub.psychoinformatics.de/inm7/annotate.inm7.de-data/issues/17#issue-1156: > The following are either missing in that the flat schema of `Dataset` does not include a specific property for it, or otherwise it is not immediately clear to me how these would be annotated in a hierarchical/linking sense: > > - `author(s)` - definitely necessary > - `homepage` - could be convenient, but could also be achieved by `exactMatch`, or a "see-also" annotation > - `doi` - could be convenient, but could also be achieved by adding an `issuedIdentifier` > - `license` - not sure if this should rather be on the `Distribution` class though > - `funding` - I am not sure where exactly this should be represented, perhaps on a different related class (like `Study`?) > - `publications` - I am not sure where exactly this should be represented I think the following properties from the `dataset_description.json` files would be good to include in this first data annotation round: - `Acknowledgements`: could go into the existing `comments` field - `Authors`: a new multivalued slot on `Dataset` with range `Person`? - `DatasetDOI`: a new slot on `Dataset` with range `IssuedIdentifier`? or `string` as per https://hub.psychoinformatics.de/inm7/inm7-concepts/src/commit/44a1cf729240568e196119ff5fee961380b6654a/src/simpleinput/unreleased.yaml#L202-L208 - `Funding`: maybe a new funding field. Could be range string to simplify it, or same as: https://hub.psychoinformatics.de/inm7/inm7-concepts/src/commit/44a1cf729240568e196119ff5fee961380b6654a/src/simpleinput/unreleased.yaml#L222-L226 - `HowToAcknowledge`: could go into the existing `comments` field - `License`: my feeling is this would be a property of a distribution, so not immediately applicable if we don't annotate that yet - `Name`: maps directly to existing name field on `Dataset` - `ReferencesAndLinks`: perhaps it would be useful to have a generic `see_also` multivalued field to house these items? These could technically also be included as curation comments, although i feel _[seeAlso](https://www.w3.org/TR/rdf-schema/#ch_seealso)_ fits particularly well. Maybe the difference is the type, i.e. string vs uri.
Author
Owner

Internal discussions say:

  • subject, sex, age are the important aspects to start with
  • let's try and break the stack with this
  • the rest of the slots/properties can still be added afterwards, once their context and relevance are better understood
Internal discussions say: - subject, sex, age are the important aspects to start with - let's try and break the stack with this - the rest of the slots/properties can still be added afterwards, once their context and relevance are better understood
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
inm7/annotate.inm7.de-data#22
No description provided.