Curation workflow brainstorming #16
Labels
No labels
bug
documentation
duplicate
enhancement
good first issue
help wanted
invalid
question
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
inm7/inm7-concepts#16
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
The aim of a metadata curation is a collection of knowledge. Knowledge is a collection of curated metadata records, assembled in some meaningful way. https://hub.psychoinformatics.de/inm7/knowledge is an example of such a collection, following the dump-things specification and containing records that are compliant with the INM7 metadata model(s).
High-level curation process
The curation process that yields this curated metadata can be quite versatile. Here we need to settle on an initial draft. This draft should be utilizing the tools we have in a sensible fashion to offer a process than can be implemented with minimal requirements (as minimal as possible).
https://hub.psychoinformatics.de/inm7/knowledge is a datalad dataset. It makes sense for a curation step to have provenance captured via datalad. In order to enable this, the uncurated metadata are ideally tracked in a datalad dataset too, which is then tracked as one (or more) subdataset(s) of the curated knowledge repo.
This mean that the token-stores need to be dataladified. It is a possibility that the dump-things-server start committing submissions to the token dataset. However, I think this is not important, and maybe not even useful. Without committing, we still know the origin of a record submission, because it is given by the token identity. The time of submission is much less important than the time of merging the curated information into the knowledge repo. Until the happens, any number of fixes to a record within the token-space are possible and useful -- all without knowing when and if they happened, because the record never left that token-space.
Individual record curation
In principle, any submission needs to be inspected. -- inspected for schema compliance, and inspected for meaning. The former is already taken care of by the API. The latter is either a manual process, or subject to a trusted implementation (e.g., auto-submission from a machine-source). Curation may involve adding, altering, or even removing information from an incoming record. This step requires expert knowledge.
Merging curated records
Two cases must be distinguished:
a) matching record already exists
b) no prior record exists
In case of (b), the curated incoming record can simply be accepted and placed into the knowledge base.
In case of (a), a merge may aim at extending an existing record, or at replacing it. There is no way to infer the intent. It must be signaled by a suitable annotation.
In general it is not sensible to ever remove a metadata record entirely. Its
pidmay be referenced elsewhere, and removing the record would also remove the possibility to understand what used to be there. However, for various reasons it may be necessary to remove individual information items from a record. In the extreme case, the record is stripped of all information other than thepid(leaving an annotation that explains the state of the record). In the case that thepiditself is problematic and needs to be removed, the entire record must be removed. However, it is understood that thepidcould be referenced in any number of placed, carrying the same problematic information. Removal of a single record is unlikely to address the issue in a comprehensive way.Implications of data removal
For data removal to be an option in a situation where curated knowledge is version-controlled, the metadata record content cannot be tracked with Git directly. Removal would require history editing in this case. Such editing would invalidate any electronic signatures or sign-offs re the curated knowledge state.
A sketch of a two-stage curation workflow.
Importantly, the information coming in via the simplified UI/forms is not necessarily the same that ends up being curated for the purpose of enhancing this UI. The incoming information is curated by
pidfrom the incoming informationOn output of records for the purpose on enhancing the data entry UIs
pidis retained (and typically different from the generatedpidof the orginal input)