Some thoughts on curation workflows #53

Merged
mih merged 1 commit from flatentry into main 2025-05-07 14:20:48 +00:00

View file

@ -59,7 +59,75 @@ flowchart LR
USER1 ~~~ USER2 ~~~ USER3 USER1 ~~~ USER2 ~~~ USER3
``` ```
## Curation workflows
Depending on the nature of the metadata and the respective audiences for producing
consuming metadata, curation workflow differ substantially. The following sections
collect some ideas and constraints to keep in mind when designing such workflows
in this context.
### PIDs also require curation
Persistent identifiers (PID) play a key role in this metadata concept. Data
models and vocabularies can change flexibly, but records still describe one and
the same `Thing` when the PID identical.
Persistent identifiers allow referencing entities in contexts where not all
information about an entity is available. One can reference a `Person` without
having to reveal possibly sensitive information about that `Person` at the same
time. For example, a public `Person` record about an academic may only contain
a name and a work contact email (equivalent to the information available on
a corresponding author in a journal publication). At the same time, an internal
`Person` record would have additional information, like a private cell phone number.
The public record can be generated from the richer, internal record by stripping
information.
#### PIDs may require mapping
However, an identifier itself can also carry information. For example, an ORCID
identifier typically can be used to reveal the name of a person. Hence when an
ORCID is used as the PID for a metadata record, any place where the identifier
is mentioned, also reveals the name of the person. If the identifier used for
an internal, protected record and a corresponding public record are the same,
cross-referencing may be enabled unintentionally.
In such cases, it can be necessary to maintain mapping tables for PIDs of the
same entity in different contexts.
Maintaining a separate PID mapping is also an instrument to aid (future)
anonymization of records. When the mapping is destroyed (and other conditions
are fulfilled too), a PID-based re-identification is potentially made impossible.
#### PIDs may require curation
When metadata records are submitted by non-experts these records already need to have
PIDs in order to enable submission of multiple, interlinked records. It is advisable
to use dedicated (actually only temporarily persistent) PIDs for this purpose.
The reason is that a submitter cannot necessarily be trusted to use the PID of an
existing record to make further statements. Instead, they may create a new record,
with the same information as an existing one, and consequently use a new PID to link
information to this entity. While a curation could keep both records, and declare them
"same as" of each other, this needlessly inflates the number of records, increases
the maintenance load, and complicates queries.
Instead, curation could merge the two records found to be on the same entity,
and retain only the already existing one, and therefore just one relevant PID.
Subsequently, all PID references of the duplicate record in the submission
could be replaced with this original PID.
Using a dedicated PID space for pre-curation PIDs, such as
`inm7:pending/<random-id>` can help the curation process by making them easier
to detect. Moreover, using random, auto-generated PIDs for new, pre-curation
records also eases the tasks for submitters. They do not have to learn and follow
possible rules for PID generations, such as using particular PID systems for certain
types of records (e.g., DOIs for publications, ORCID for researchers, ROR IDs for
organizations, RRIDs for resources, etc). This task could be left to professional
curators.
## Acknowledgements ## Acknowledgements
This work was funded by This work was funded by the MKW-NRW: Ministerium für Kultur und Wissenschaft
des Landes Nordrhein-Westfalen under the Kooperationsplattformen 2022 program,
grant number: KP22-106A.