Serialized and structured format for metadata conforming to `dlco`-based schemas

jsheunis commented

2024-12-12 22:37:08 +00:00

(Migrated from hub.datalad.org)

Having a serialized and structured format for storing metadata compliant with dlco schemas will have important benefits:

Serialized and structured metadata can be version controlled with git, in a structured way such that humans can also inspect and understand diffs
A standard structure for a directory tree and file content enables easy scripting based on a known format, which simplifies the development of interoperable tools (e.g. querying, find/edit, validation pipelines, harmonization)
A structure that specifies schemas and classes will simplify validation

The current idea is as follows:

Store metadata in text files in a git repository with a deterministic directory structure
The 1st, i.e. top, level directory names are recognized schema names (and versions)
The 2nd level directory names will be specific classes defined by these schemas
Inside the 2nd level directories, i.e. class directories, will be multiple plain text files describing nodes that have the respective class as rdf:type
A text file name will be the ID of the specific node
The text file will contain the metadata related to the specific node

A demonstrative example:

.
├── sdd-UNRELEASED
│   ├── Publication
│   │   └── https://doi.org/10.1038/s41597-022-01163-2.txt

The format of the text file should also be structured. From an RDF viewpoint, the content would be all triples that have the relevant node ID as subject, meaning that the text file should contain a list of predicate-object pairs. E.g.

title: "FAIRly big: A framework for computationally reproducible processing of large-scale data"

The question is, how exactly should this content be structured.

Some open questions to consider:

What about blank nodes? Will we use blank node IDs (as provided by any given graph, or otherwise randomly assigned UUIDs) as the node ID?
Which format should text files have? TSV / YAML? Should blank nodes within text files be resolved recursively?
Should the schema version be encoded in the same top-level directory name (as shown in the example above), or should another directory level be introduced
The schema (and version) is relevant and known when linked metadata is validated with a known schema version, meaning that linked metadata can be converted to this serialized+structured format if they are validated. This implies that validation is a required necessary step before data can be converted to this structure. If no validation is performed, conversion from e.g. RDF to such a structured format will lose the top-level directory, while the rest of the structure would still be technically viable.

Having a serialized and structured format for storing metadata compliant with `dlco` schemas will have important benefits: - Serialized and structured metadata can be version controlled with git, in a structured way such that humans can also inspect and understand diffs - A standard structure for a directory tree and file content enables easy scripting based on a known format, which simplifies the development of interoperable tools (e.g. querying, find/edit, validation pipelines, harmonization) - A structure that specifies schemas and classes will simplify validation The current idea is as follows: - Store metadata in text files in a git repository with a deterministic directory structure - The 1st, i.e. top, level directory names are recognized schema names (and versions) - The 2nd level directory names will be specific classes defined by these schemas - Inside the 2nd level directories, i.e. class directories, will be multiple plain text files describing nodes that have the respective class as `rdf:type` - A text file name will be the ID of the specific node - The text file will contain the metadata related to the specific node A demonstrative example: ``` . ├── sdd-UNRELEASED │ ├── Publication │ │ └── https://doi.org/10.1038/s41597-022-01163-2.txt ``` The format of the text file should also be structured. From an RDF viewpoint, the content would be all triples that have the relevant node ID as subject, meaning that the text file should contain a list of `predicate-object` pairs. E.g. ``` title: "FAIRly big: A framework for computationally reproducible processing of large-scale data" ``` The question is, how exactly should this content be structured. Some open questions to consider: - What about blank nodes? Will we use blank node IDs (as provided by any given graph, or otherwise randomly assigned UUIDs) as the node ID? - Which format should text files have? TSV / YAML? Should blank nodes within text files be resolved recursively? - Should the schema version be encoded in the same top-level directory name (as shown in the example above), or should another directory level be introduced - The schema (and version) is relevant and known when linked metadata is validated with a known schema version, meaning that linked metadata can be converted to this serialized+structured format if they are validated. This implies that validation is a required necessary step before data can be converted to this structure. If no validation is performed, conversion from e.g. RDF to such a structured format will lose the top-level directory, while the rest of the structure would still be technically viable.

jsheunis commented

2024-12-20 13:11:12 +00:00

(Migrated from hub.datalad.org)

Quick notes from @mih:

had the thought that we should not use the object IDs as filenames verbatim (will cause accessibility issues and encoding issues. I'd say it should be a hash of the id that is used for the filename, plus a format extension.

Schema versions must encoded. I have no strong opinion on how. Leaning towards a dedicated directory level. But argument for not doing that are easy to find too

Quick notes from @mih: > had the thought that we should not use the object IDs as filenames verbatim (will cause accessibility issues and encoding issues. I'd say it should be a hash of the id that is used for the filename, plus a format extension. > Schema versions must encoded. I have no strong opinion on how. Leaning towards a dedicated directory level. But argument for not doing that are easy to find too

mih commented

2024-12-20 13:16:39 +00:00

(Migrated from hub.datalad.org)

What about blank nodes? Will we use blank node IDs (as provided by any given graph, or otherwise randomly assigned UUIDs) as the node ID?

There cannot be blank node IDs, We can only ever use persistent IDs.

This means we cannot represent stand-alone records of non-Things. In practice, this should be not a limitation. All such cases are either attribute specification or association classes, which are both technical helpers that yield no meaningful standalone record.

Which format should text files have? TSV / YAML?

From a format definition POV, I see no need to constrain the format with a requirement. I think it makes sense to have YAML be a recommendation for a default (because it is rather flexible and, importantly, can host comments). In general, multiple formats should be possible, identified by a proper file extension, and used as determined by conventions or IO-needs.

Should blank nodes within text files be resolved recursively?

There will be no blank nodes.

Should the schema version be encoded in the same top-level directory name (as shown in the example above), or should another directory level be introduced

For short-term projects, a dedicated version level will feel over-complex. However, For anything that lives long enough, multi-version will be a given. Also, having a dedicated version directory would make it easy to link to a dedicated repo with information in a specific version of a schema, without having to pile everything from the past into one and the same repo.

I'd go for a dedicated level.

> What about blank nodes? Will we use blank node IDs (as provided by any given graph, or otherwise randomly assigned UUIDs) as the node ID? There cannot be blank node IDs, We can only ever use persistent IDs. This means we cannot represent stand-alone records of non-`Thing`s. In practice, this should be not a limitation. All such cases are either attribute specification or association classes, which are both technical helpers that yield no meaningful standalone record. > Which format should text files have? TSV / YAML? From a format definition POV, I see no need to constrain the format with a requirement. I think it makes sense to have YAML be a recommendation for a default (because it is rather flexible and, importantly, can host comments). In general, multiple formats should be possible, identified by a proper file extension, and used as determined by conventions or IO-needs. > Should blank nodes within text files be resolved recursively? There will be no blank nodes. > Should the schema version be encoded in the same top-level directory name (as shown in the example above), or should another directory level be introduced For short-term projects, a dedicated version level will feel over-complex. However, For anything that lives long enough, multi-version will be a given. Also, having a dedicated version directory would make it easy to link to a dedicated repo with information in a specific version of a schema, without having to pile everything from the past into one and the same repo. I'd go for a dedicated level.

mih commented

2024-12-20 13:20:20 +00:00

(Migrated from hub.datalad.org)

re item file names: I think we cannot use IDs literally. They are URIs, and could be prohibitively long and complex. I'd say we fix the max length to something reasonable and hash the trailing end with md5, and include the hash.

Anyone that has the ID can determine the filename from the ID using the defined algorithm. Anyone who does not have the ID needs to search anyways, and once found the id can be read from the file content.

re item file names: I think we cannot use IDs literally. They are URIs, and could be prohibitively long and complex. I'd say we fix the max length to something reasonable and hash the trailing end with md5, and include the hash. Anyone that has the ID can determine the filename from the ID using the defined algorithm. Anyone who does not have the ID needs to search anyways, and once found the id can be read from the file content.

jsheunis commented

2024-12-20 14:26:16 +00:00

(Migrated from hub.datalad.org)

Thanks. Taking an example from https://github.com/psychoinformatics-de/datalad-concepts/blob/main/src/sdd/unreleased/examples/Resource-funding.yaml we have the original YAML data (note: I have changed the data slightly to use the more recent updates to Organization and I'm ignoring upcoming changes to the schemas):

id: exthisdsver:#
relation:
  - id: https://gepris.dfg.de/gepris/projekt/431549029
    schema_type: dlsdd:Grant
    name: SFB1451
    sponsor: https://ror.org/018mejw64
  - id:  https://ror.org/018mejw64
    schema_type: dlprov:Organization
    name: Deutsche Forschungsgemeinschaft
qualified_relations:
  - object: https://gepris.dfg.de/gepris/projekt/431549029
    had_role:
      - schema:funding
was_attributed_to:
  - https://ror.org/018mejw64

When transforming this into our current structured metadata format we would get the following directory tree:

.
├─ distribution
│   └─ UNRELEASED
│      └─ Resource
│          └─ md5sum("exthisdsver:#").yaml
├─ prov
│   └─ UNRELEASED
│      └─ Organization
│          └─ md5sum("https://ror.org/018mejw64").yaml
└─ sdd
    └─ UNRELEASED
       └─ Grant
           └─ md5sum("https://gepris.dfg.de/gepris/projekt/431549029").yaml

and the respective file content will be:

md5sum("exthisdsver:#").yaml:

id: exthisdsver:#
relation:
  - id: https://gepris.dfg.de/gepris/projekt/431549029
  - id:  https://ror.org/018mejw64
qualified_relations:
  - object: https://gepris.dfg.de/gepris/projekt/431549029
    had_role:
      - schema:funding
was_attributed_to:
  - https://ror.org/018mejw64

md5sum("https://ror.org/018mejw64").yaml:

id:  https://ror.org/018mejw64
schema_type: dlprov:Organization
name: Deutsche Forschungsgemeinschaft

md5sum("https://gepris.dfg.de/gepris/projekt/431549029").yaml:

id: https://gepris.dfg.de/gepris/projekt/431549029
schema_type: dlsdd:Grant
name: SFB1451
sponsor: https://ror.org/018mejw64

I think CURIEs will have to be resolved to full URIs. If not, the format should support some way of defining prefixes, or it would always depend on the relevant schema definitions for these prefixes.

Thanks. Taking an example from https://github.com/psychoinformatics-de/datalad-concepts/blob/main/src/sdd/unreleased/examples/Resource-funding.yaml we have the original YAML data (note: I have changed the data slightly to use the more recent updates to `Organization` and I'm ignoring upcoming changes to the schemas): ```yaml id: exthisdsver:# relation: - id: https://gepris.dfg.de/gepris/projekt/431549029 schema_type: dlsdd:Grant name: SFB1451 sponsor: https://ror.org/018mejw64 - id: https://ror.org/018mejw64 schema_type: dlprov:Organization name: Deutsche Forschungsgemeinschaft qualified_relations: - object: https://gepris.dfg.de/gepris/projekt/431549029 had_role: - schema:funding was_attributed_to: - https://ror.org/018mejw64 ``` When transforming this into our current structured metadata format we would get the following directory tree: ``` . ├─ distribution │ └─ UNRELEASED │ └─ Resource │ └─ md5sum("exthisdsver:#").yaml ├─ prov │ └─ UNRELEASED │ └─ Organization │ └─ md5sum("https://ror.org/018mejw64").yaml └─ sdd └─ UNRELEASED └─ Grant └─ md5sum("https://gepris.dfg.de/gepris/projekt/431549029").yaml ``` and the respective file content will be: `md5sum("exthisdsver:#").yaml`: ```yaml id: exthisdsver:# relation: - id: https://gepris.dfg.de/gepris/projekt/431549029 - id: https://ror.org/018mejw64 qualified_relations: - object: https://gepris.dfg.de/gepris/projekt/431549029 had_role: - schema:funding was_attributed_to: - https://ror.org/018mejw64 ``` `md5sum("https://ror.org/018mejw64").yaml`: ```yaml id: https://ror.org/018mejw64 schema_type: dlprov:Organization name: Deutsche Forschungsgemeinschaft ``` `md5sum("https://gepris.dfg.de/gepris/projekt/431549029").yaml`: ```yaml id: https://gepris.dfg.de/gepris/projekt/431549029 schema_type: dlsdd:Grant name: SFB1451 sponsor: https://ror.org/018mejw64 ``` I think CURIEs will have to be resolved to full URIs. If not, the format should support some way of defining prefixes, or it would always depend on the relevant schema definitions for these prefixes.

mih commented

2025-01-11 08:20:15 +00:00

(Migrated from hub.datalad.org)

Thanks, this looks correct.

I am not convinced that we need to resolve URIs. I think any prefix that could be used in data records also needs to be defined in the schema for the record to be valid. If the prefix must be defined in the schema, it is guaranteed to be available for a specific schema-version, and all records are stored underneath schema-version specific locations in the directory structure.

I am reluctant to accept the need for any semantic processing, because it will be extremely slow in comparison to just slurping in the pure data structures. And it will also depend on external infrastructure to work -- which is a massive dependency from a longevity perspective.

Thanks, this looks correct. I am not convinced that we need to resolve URIs. I think any prefix that could be used in data records also needs to be defined in the schema for the record to be valid. If the prefix must be defined in the schema, it is guaranteed to be available for a specific schema-version, and all records are stored underneath schema-version specific locations in the directory structure. I am reluctant to accept the need for any semantic processing, because it will be extremely slow in comparison to just slurping in the pure data structures. And it will also depend on external infrastructure to work -- which is a massive dependency from a longevity perspective.

Rows
Columns

Serialized and structured format for metadata conforming to dlco-based schemas #10

Serialized and structured format for metadata conforming to `dlco`-based schemas #10