Serialized and structured format for metadata conforming to dlco-based schemas #10
Labels
No labels
bug
duplicate
enhancement
help wanted
invalid
question
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
orinoco/tools#10
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Having a serialized and structured format for storing metadata compliant with
dlcoschemas will have important benefits:The current idea is as follows:
rdf:typeA demonstrative example:
The format of the text file should also be structured. From an RDF viewpoint, the content would be all triples that have the relevant node ID as subject, meaning that the text file should contain a list of
predicate-objectpairs. E.g.The question is, how exactly should this content be structured.
Some open questions to consider:
Quick notes from @mih:
There cannot be blank node IDs, We can only ever use persistent IDs.
This means we cannot represent stand-alone records of non-
Things. In practice, this should be not a limitation. All such cases are either attribute specification or association classes, which are both technical helpers that yield no meaningful standalone record.From a format definition POV, I see no need to constrain the format with a requirement. I think it makes sense to have YAML be a recommendation for a default (because it is rather flexible and, importantly, can host comments). In general, multiple formats should be possible, identified by a proper file extension, and used as determined by conventions or IO-needs.
There will be no blank nodes.
For short-term projects, a dedicated version level will feel over-complex. However, For anything that lives long enough, multi-version will be a given. Also, having a dedicated version directory would make it easy to link to a dedicated repo with information in a specific version of a schema, without having to pile everything from the past into one and the same repo.
I'd go for a dedicated level.
re item file names: I think we cannot use IDs literally. They are URIs, and could be prohibitively long and complex. I'd say we fix the max length to something reasonable and hash the trailing end with md5, and include the hash.
Anyone that has the ID can determine the filename from the ID using the defined algorithm. Anyone who does not have the ID needs to search anyways, and once found the id can be read from the file content.
Thanks. Taking an example from https://github.com/psychoinformatics-de/datalad-concepts/blob/main/src/sdd/unreleased/examples/Resource-funding.yaml we have the original YAML data (note: I have changed the data slightly to use the more recent updates to
Organizationand I'm ignoring upcoming changes to the schemas):When transforming this into our current structured metadata format we would get the following directory tree:
and the respective file content will be:
md5sum("exthisdsver:#").yaml:md5sum("https://ror.org/018mejw64").yaml:md5sum("https://gepris.dfg.de/gepris/projekt/431549029").yaml:I think CURIEs will have to be resolved to full URIs. If not, the format should support some way of defining prefixes, or it would always depend on the relevant schema definitions for these prefixes.
Thanks, this looks correct.
I am not convinced that we need to resolve URIs. I think any prefix that could be used in data records also needs to be defined in the schema for the record to be valid. If the prefix must be defined in the schema, it is guaranteed to be available for a specific schema-version, and all records are stored underneath schema-version specific locations in the directory structure.
I am reluctant to accept the need for any semantic processing, because it will be extremely slow in comparison to just slurping in the pure data structures. And it will also depend on external infrastructure to work -- which is a massive dependency from a longevity perspective.