Note on curation load aware schema design #73

Merged
mih merged 1 commit from flatentry into main 2025-05-23 14:21:26 +00:00

View file

@ -11,11 +11,11 @@ uses their own data models. Each system allows for submission of additional
or edited records to a staging area where submissions can be subjected to or edited records to a staging area where submissions can be subjected to
verification and curation, before they are accepted. verification and curation, before they are accepted.
Metadata records from each system can be transformed to be compliant with a Metadata records from each system can be losslessly transformed to be compliant
generic use case agnostic data model. This generic data model facilitates the with a generic use case agnostic data model. This generic data model
integration of information across applications and workflows. Transformed facilitates the integration of information across applications and workflows.
metadata records are, again, submitted for curation and integration into Transformed metadata records are, again, submitted for curation and integration
a central knowledge base. into a central knowledge base.
This central knowledge base can be queried to produce integrated reports. This central knowledge base can be queried to produce integrated reports.
Knowledge base records can also be exported to the data models of individual Knowledge base records can also be exported to the data models of individual
@ -66,11 +66,26 @@ consuming metadata, curation workflow differ substantially. The following sectio
collect some ideas and constraints to keep in mind when designing such workflows collect some ideas and constraints to keep in mind when designing such workflows
in this context. in this context.
### Design schemas to reduce churn
Data models should be designed to prefer linkage to broader, more slowly evolving,
less context constrained entities. For example, the relationship between a
container-type entity and its parts should be implemented by a `part_of`
relationship, rather than a list of `parts` in the container. This enables
the addition of a new part via the creation of a single, additional record
-- as opposed to having to create the new record, and then also having to update
the part-list.
This design choice does not limit the on-demand construction of part-lists
for "runtime" representations of knowledge for query-focused applications.
But it reduces to load on data curation workflows, by reducing the number of
events that require knowledge merge operations, in favor of plain additions.
### PIDs also require curation ### PIDs also require curation
Persistent identifiers (PID) play a key role in this metadata concept. Data Persistent identifiers (PID) play a key role in this metadata concept. Data
models and vocabularies can change flexibly, but records still describe one and models and vocabularies can change flexibly, but records still describe one and
the same `Thing` when the PID identical. the same `Thing` when the PID is identical.
Persistent identifiers allow referencing entities in contexts where not all Persistent identifiers allow referencing entities in contexts where not all
information about an entity is available. One can reference a `Person` without information about an entity is available. One can reference a `Person` without