Curation workflow brainstorming #16

New issue

Open

opened 2025-04-21 18:41:29 +00:00 by mih · 1 comment

mih commented

2025-04-21 18:41:29 +00:00

Owner

High-level curation process

The curation process that yields this curated metadata can be quite versatile. Here we need to settle on an initial draft. This draft should be utilizing the tools we have in a sensible fashion to offer a process than can be implemented with minimal requirements (as minimal as possible).

https://hub.psychoinformatics.de/inm7/knowledge is a datalad dataset. It makes sense for a curation step to have provenance captured via datalad. In order to enable this, the uncurated metadata are ideally tracked in a datalad dataset too, which is then tracked as one (or more) subdataset(s) of the curated knowledge repo.

This mean that the token-stores need to be dataladified. It is a possibility that the dump-things-server start committing submissions to the token dataset. However, I think this is not important, and maybe not even useful. Without committing, we still know the origin of a record submission, because it is given by the token identity. The time of submission is much less important than the time of merging the curated information into the knowledge repo. Until the happens, any number of fixes to a record within the token-space are possible and useful -- all without knowing when and if they happened, because the record never left that token-space.

Individual record curation

In principle, any submission needs to be inspected. -- inspected for schema compliance, and inspected for meaning. The former is already taken care of by the API. The latter is either a manual process, or subject to a trusted implementation (e.g., auto-submission from a machine-source). Curation may involve adding, altering, or even removing information from an incoming record. This step requires expert knowledge.

Merging curated records

Two cases must be distinguished:

a) matching record already exists
b) no prior record exists

In case of (b), the curated incoming record can simply be accepted and placed into the knowledge base.

In case of (a), a merge may aim at extending an existing record, or at replacing it. There is no way to infer the intent. It must be signaled by a suitable annotation.

In general it is not sensible to ever remove a metadata record entirely. Its pid may be referenced elsewhere, and removing the record would also remove the possibility to understand what used to be there. However, for various reasons it may be necessary to remove individual information items from a record. In the extreme case, the record is stripped of all information other than the pid (leaving an annotation that explains the state of the record). In the case that the pid itself is problematic and needs to be removed, the entire record must be removed. However, it is understood that the pid could be referenced in any number of placed, carrying the same problematic information. Removal of a single record is unlikely to address the issue in a comprehensive way.

Implications of data removal

For data removal to be an option in a situation where curated knowledge is version-controlled, the metadata record content cannot be tracked with Git directly. Removal would require history editing in this case. Such editing would invalidate any electronic signatures or sign-offs re the curated knowledge state.

The aim of a metadata curation is a collection of knowledge. Knowledge is a collection of curated metadata records, assembled in some meaningful way. https://hub.psychoinformatics.de/inm7/knowledge is an example of such a collection, following the dump-things specification and containing records that are compliant with the INM7 metadata model(s). ### High-level curation process The curation process that yields this curated metadata can be quite versatile. Here we need to settle on an initial draft. This draft should be utilizing the tools we have in a sensible fashion to offer a process than can be implemented with minimal requirements (as minimal as possible). https://hub.psychoinformatics.de/inm7/knowledge is a datalad dataset. It makes sense for a curation step to have provenance captured via datalad. In order to enable this, the uncurated metadata are ideally tracked in a datalad dataset too, which is then tracked as one (or more) subdataset(s) of the curated knowledge repo. This mean that the token-stores need to be dataladified. It is a possibility that the dump-things-server start committing submissions to the token dataset. However, I think this is not important, and maybe not even useful. Without committing, we still know the origin of a record submission, because it is given by the token identity. The time of submission is much less important than the time of merging the curated information into the knowledge repo. Until the happens, any number of fixes to a record within the token-space are possible and useful -- all without knowing when and if they happened, because the record never left that token-space. ### Individual record curation In principle, any submission needs to be inspected. -- inspected for schema compliance, and inspected for meaning. The former is already taken care of by the API. The latter is either a manual process, or subject to a trusted implementation (e.g., auto-submission from a machine-source). Curation may involve adding, altering, or even removing information from an incoming record. This step requires expert knowledge. ### Merging curated records Two cases must be distinguished: a) matching record already exists b) no prior record exists In case of (b), the curated incoming record can simply be accepted and placed into the knowledge base. In case of (a), a merge may aim at extending an existing record, or at replacing it. There is no way to infer the intent. It must be signaled by a suitable annotation. In general it is not sensible to ever remove a metadata record entirely. Its `pid` may be referenced elsewhere, and removing the record would also remove the possibility to understand what *used* to be there. However, for various reasons it may be necessary to remove individual information items from a record. In the extreme case, the record is stripped of all information other than the `pid` (leaving an annotation that explains the state of the record). In the case that the `pid` itself is problematic and needs to be removed, the entire record must be removed. However, it is understood that the `pid` could be referenced in any number of placed, carrying the same problematic information. Removal of a single record is unlikely to address the issue in a comprehensive way. ### Implications of data removal For data removal to be an option in a situation where curated knowledge is version-controlled, the metadata record content cannot be tracked with Git directly. Removal would require history editing in this case. Such editing would invalidate any electronic signatures or sign-offs re the curated knowledge state.

mih commented

2025-04-23 08:11:05 +00:00

Author

Owner

A sketch of a two-stage curation workflow.

Rectangles are services
Diamonds are manual/automated processes
the rest are documents

Importantly, the information coming in via the simplified UI/forms is not necessarily the same that ends up being curated for the purpose of enhancing this UI. The incoming information is curated by

assigning a suitable pid from the incoming information
generating multiple inter-related instances from a single input record, by assigning roles and other properties to relationships

On output of records for the purpose on enhancing the data entry UIs

simplified and stripped object representations are generated
only information useful for identifying existing records for selection is kept
most importantly the pid is retained (and typically different from the generated pid of the orginal input)

flowchart TB
SISchema@{shape: doc, label: Simpleinput schema }
KBSchema@{shape: doc, label: Knowledge base schema }
SIUI@{ shape: rect, label: Simpleinput UI }
KBUI@{ shape: rect, label: Knowledge base UI }
SIApi@{ shape: rect, label: Simpleinput API }
KBApi@{ shape: rect, label: Knowledge base API }
SITokenRecords@{ shape: docs, label: Simpleinput records<br>(token space) }
SIRecords@{ shape: docs, label: Simpleinput records<br>(curated) }
KBTokenRecords@{ shape: docs, label: Kowledge base records<br>(token space) }
KBRecords@{ shape: docs, label: Knowledge base records }
AssignPID@{shape: diamond, label: PID assignment }
Limit4SI@{shape: diamond, label: Limit information }
KBCuration@{shape: diamond, label: Curation<br>Add/merge/replace }
Transform4SI@{shape: diamond, label: Data model<br>transformation }
BuildRelations@{shape: diamond, label: Build<br>qualified relations}
Transform4KB@{shape: diamond, label: Data model<br>transformation }

KBSchema-->|uses|KBUI
KBSchema-->|uses|KBApi
KBSchema-->|compliant with|KBTokenRecords
KBSchema-->|compliant with|KBRecords
SISchema-->|uses|SIUI
SISchema-->|uses|SIApi
SISchema-->|compliant with|SITokenRecords
SISchema-->|compliant with|SIRecords
SISchema-->|informs|Transform4KB
KBSchema-->|informs|Transform4KB
SISchema-->|informs|Transform4SI
KBSchema-->|informs|Transform4SI
SIUI-->SIApi-->SITokenRecords

SITokenRecords-->Transform4KB-->BuildRelations-->AssignPID-->KBApi-->KBTokenRecords
KBRecords-->Transform4SI-->Limit4SI-->SIRecords
KBTokenRecords-->KBCuration-->KBRecords
KBUI-->KBApi-->KBTokenRecords
KBRecords-->|informs|KBUI
SIRecords-->|informs|SIUI
KBRecords-->|informs|AssignPID
KBRecords-->|informs|BuildRelations

Transform4KB ~~~ Transform4SI

A sketch of a two-stage curation workflow. - Rectangles are services - Diamonds are manual/automated processes - the rest are documents Importantly, the information coming in via the simplified UI/forms is not necessarily the same that ends up being curated for the purpose of enhancing this UI. The incoming information is curated by - assigning a suitable `pid` from the incoming information - generating multiple inter-related instances from a single input record, by assigning roles and other properties to relationships On output of records for the purpose on enhancing the data entry UIs - simplified and stripped object representations are generated - only information useful for identifying existing records for selection is kept - most importantly the `pid` is retained (and typically different from the generated `pid` of the orginal input) ```mermaid flowchart TB SISchema@{shape: doc, label: Simpleinput schema } KBSchema@{shape: doc, label: Knowledge base schema } SIUI@{ shape: rect, label: Simpleinput UI } KBUI@{ shape: rect, label: Knowledge base UI } SIApi@{ shape: rect, label: Simpleinput API } KBApi@{ shape: rect, label: Knowledge base API } SITokenRecords@{ shape: docs, label: Simpleinput records (token space) } SIRecords@{ shape: docs, label: Simpleinput records (curated) } KBTokenRecords@{ shape: docs, label: Kowledge base records (token space) } KBRecords@{ shape: docs, label: Knowledge base records } AssignPID@{shape: diamond, label: PID assignment } Limit4SI@{shape: diamond, label: Limit information } KBCuration@{shape: diamond, label: Curation Add/merge/replace } Transform4SI@{shape: diamond, label: Data model transformation } BuildRelations@{shape: diamond, label: Build qualified relations} Transform4KB@{shape: diamond, label: Data model transformation } KBSchema-->|uses|KBUI KBSchema-->|uses|KBApi KBSchema-->|compliant with|KBTokenRecords KBSchema-->|compliant with|KBRecords SISchema-->|uses|SIUI SISchema-->|uses|SIApi SISchema-->|compliant with|SITokenRecords SISchema-->|compliant with|SIRecords SISchema-->|informs|Transform4KB KBSchema-->|informs|Transform4KB SISchema-->|informs|Transform4SI KBSchema-->|informs|Transform4SI SIUI-->SIApi-->SITokenRecords SITokenRecords-->Transform4KB-->BuildRelations-->AssignPID-->KBApi-->KBTokenRecords KBRecords-->Transform4SI-->Limit4SI-->SIRecords KBTokenRecords-->KBCuration-->KBRecords KBUI-->KBApi-->KBTokenRecords KBRecords-->|informs|KBUI SIRecords-->|informs|SIUI KBRecords-->|informs|AssignPID KBRecords-->|informs|BuildRelations Transform4KB ~~~ Transform4SI ```