Thoughts/ideas on the general "datalink" application landscape #8

New issue

Open

opened 2024-10-16 08:27:54 +00:00 by jsheunis · 0 comments

jsheunis commented

2024-10-16 08:27:54 +00:00

(Migrated from hub.datalad.org)

@mih these are some main points I took from our discussion recently, feel free to add/edit.

What do we want to achieve?

We have a few specific use cases, that can all be generalized in order to develop tools that are reusable.

1. An electronic case report tool

Generally: https://en.wikipedia.org/wiki/Case_report_form

The idea is that a group wants to plan a study to collect data, typically (but not exclusively) from participants. They will have an idea about what the data would look like. The would get together and create a plan, along the lines of:

The data will be collected from a collection of participants / sources having these characteristics
The data will be collected in the form of measurements / samples / observations using these methods or tools
The data will be collected over the course of these sessions and at these sites
The data will be produced in these formats

and so forth...

What they are basically doing is putting together a semantic structure for what data points in their eventual dataset(s) would look like, and according to which data should eventually be collected, entered, and validated. Thus, defining a schema for data entry and validation. Part of the electronic case reporting process is then to actually collect (and eventually validate) the data during the study, using such a tool.

2. A data annotation tool/process

Similar to the data entry step in the electronic case reporting use case, here a group of people with similar research interests might want to describe their existing (or planned/evolving) datasets in a way that make them findable in a common catalog, or annotate them with metadata fields that are required for archiving purposes at a particular research site. A common example is the metadata that someone is required to enter if they want to upload their dataset to some repository, such as OpenNeuro or a Dataverse instance, or the SFB1451 catalog - think authors, data controllers, keywords, date created, linked resourced, and the like. These attributes/annotations could of course be vastly different, semantically and structurally, depending on the use case.

Here, again, a planning step would involve putting together a semantic structure for what the annotations should look like, and a form that is made available for data creators/maintainers to annotate their datasets should follow this structure and ideally validate the data entries. In addition, the outcome should be that the datasets and their content are linked to the new annotations, which would enable a machine-actionable process of finding actual data files by "searching" for annotations.

3. A way to represent and find annotated (DataLad) datasets in a catalog

Discoverability is the core principle here, but also discoverability with accessibility as the eventual goal. People want or need a way to advertise their data, either privately to a contained group, or publicly (even when the actual data content should be kept private for data privacy/sensitivity/security reasons).

Users should be able to employ searching/filtering functionality in order to find relevant (parts of) datasets, e.g. show me all the datasets that have Parkinson's patients as participants and where the participants are older than 60 years, or show me all the samples of some protein that were extracted from a particular type of tree bark in a particular region during the year 2021.

Once datasets are found, they should be presented in an intuitive way, and this should be the same information that a dataset would have been annotated with in order to form part of the catalog, and it might also even present the data files and even their content.

4. DataLad-based access to a non-DataLad dataset / store / portal

The goal here is to provide a simple means for e.g. a portal operator (such as the catalog described above) to expose essential metadata for automated, on-demand DataLad dataset generation that requires no or minimal dedicated implementation for data access via DataLad. If such a service does not have to run DataLad/git, if it can make metadata available via standard access methods (e.g. http/s), and if the means exist to generate a DataLad dataset from a required set of metadata descriptors, it would mean:

lowering the threshold for adoption of DataLad for data/file retrieval
compatibility between DataLad-based and non-DataLad-based systems can be established without having to adopt a DataLad software stack
custom datasets can be put together from different sources (think: run a query in a catalog to find data samples that satisfy specific criteria, aka find a "cohort", and generate a fully functional DataLad dataset from the results)

What underlies all of the above?

These use cases, though different in their purposes and application, are actually all part of the same problem space. By analysing them together, we can identify a few core aspects that underlie all of them, or from which all of them will benefit.

A) Schema/ontology development

Defining a semantic structure for the form that data can take lies at the core of each of these use cases. When you investigate each of them, it becomes evident that we need a schema for:

defining the structure of the data that will be collected in a study, using an electronic case reporting tool
validating data collected during the study using an electronic case reporting tool
defining the structure of the metadata that will be used to annotate datasets going into a catalog
validating the metadata collected for the purpose of representing a dataset in a catalog
defining the essential concepts required in a set of metadata in order to generate a functional DataLad dataset from it

These purposes can be summarized as: modeling and validation

Existing work

We have already covered ground in this domain, primarily developing the DataLad-Concepts-Ontology-based schemas such as Thing, Entity, Agent, Activity, Distribution, etc. These were heavily influenced by DCATv3 and were created in YAML format using the linked data modeling tool (a.k.a. schema authoring tool) LinkML

The benefits of these schemas, and LinkML in particular, are:

the structured schema concepts (aka classes) provide the building blocks from which to build practical schemas to send out into the wild (e.g. a study with participants and collected measures and generated files)
the structured schema concepts (aka classes) provide the building blocks from which to describe the essential components of a DataLad dataset (e.g. commits and provenance related to file distributions, and data access specifications)
LinkML enables data validation according to to a given schema
LinkML ensures access to the broader world of linked data and RDF by providing generators for schemas into various formats (e.g. SHACL, jsonschema, OWL, etc) and providing translation between data formats (TTL, YAML, json-ld, etc)
The expression of schemas and data as linked metadata gives us the framework for linking seemingly separate but related concepts through semantic relationships.

B) Automatic user interface generation

A second, and essential, use of schemas in these use cases is user interface generation. In principle, if we have a machine-actionable definition of the structure of a dataset or some related concepts, i.e. a schema, we can use this to generate a user interface, or more specifically: editors and viewers. This means we can:

not only define what the collected data in a case reporting process should look like, but also use this definition to generate a form to actually collect the study data
not only define what the annotations for inclusion of a dataset in a catalog should look like, but also use this definition to generate a form for users to annotate and submit their datasets to the catalog
enable real-time validation of data entered into the automatically generated forms: if we know a particular field is of type String and shouldn't be longer than 40 characters and it can have multiple values, i.e. form a list, then we can automatically generate input-specific validation rules that are assessed as these values are entered (which prevents faulty or missing data)
build automatic viewers/renderers based on the semantic structure of specific concepts/items: if we know a particular metadata item is of type Dataset we can build a viewer that should display a dataset and all of its expected fields in deterministic way, as defined in the schema.
build query interfaces that determine its query-variables automatically from the structure of the linked data that it should search.

Linking this back to LinkML and semantic+linked metadata:

semantically defined and widely agreed upon concepts (such as a schema:Person or dcat:Distribution) can receive generic viewer or editor components that can be reused across tools / applications / domains
if the schema language used to author a schema (such as LinkML YAML) is itself subject to its own schema constraints (which is the case for LinkML), such autogeneration tools can be used to create a generic schema authoring tool (aka "schema-ception")

Existing work

We have already done work in this domain of automatic user interface generation. The first iteration was datalad-catalog, which had the aspect of automatic generation of a data catalog and its entries from a schema and metadata, although was missing the linked data building block. The current iteration is shacl-vue.

For future development, the particular tech stack (such as VueJS and RDF-Ext used in shacl-vue) is not necessarily the most important consideration, and can actually be a deterrent to progress when considering the bus factor. What is important at this stage is to continue developing in order to identify underlying functional components of such autogeneration tools, which can then be replicated and improved upon in future iterations irrespective of the chosen tech stack.

C) A serialized and portable format for (DataLad) datasets and their annotations

Expressing datasets and their relations as semantic and linked metadata in a simple format (think RDF triples in a plain text file) is something that follows naturally from the implementation and use of the tools described above. Such a process, and the resulting serialized metadata, have several benefits:

portability: a complete set of descriptions of a dataset and its content can be serialized and transported easily and separately from the data or content itself, in a plain and simple format that doesn't require proprietary readers or complex services to run on special infrastructure (DataLad doesn't have to be installed on the server, a graph store doesn't have to run wherever a metadata record is kept)
machine-actionability: basically, we can generate a DataLad dataset from the metadata, we can render a catalog record of the metadata, ...
interoperability: RDF is ubiquitous in the world wide web. By deciding to learn and "speak" it, we make DataLad datasets or just any concepts defined and annotated by our tools compatible with it, meaning a wide array of existing tools would become available.

Existing work

Initial work on generating metadata from a DataLad dataset was done in the datalad-metalad extension, and several extractors have been added to that repertoire (metalad_core for datasets and files, bids_dataset, datacite_gin, ...). Extractors take a strutured dataset with specific attributes as the input, and outputs a JSON-serialized metadata record. For interoperability with datalad-catalog and its schema, appropriate translators were also developed.
The ebrains extension implements a conversion of a standardized metadata record, generated via a graph query, into an operational DataLad dataset.
datalad-tabby

So what should our practical focus be?

Concisely:

continued development of datalad-concepts to improve all building blocks (i.e. base and derived classes) for use in authoring custom schemas
an intuitive and user-friendly schema authoring tool, that can make use of these building blocks
tool(s) for automatic generation of forms (editors) and catalogs (viewers) from schemas; currently => shacl-vue
- for forms: the ability to "upload/import" a set of existing semantic metadata, add to and edit it (i.e. extend the graph), and then "download" the resulting metadata set
- the ability to export a graph of linked metadata to various serialized formats
- for catalogs: a scalable and automatic query interface
Tools for validating a set of linked metadata against a schema
Tools for serialization and de-serialization of DataLad datasets using the datalad-concepts building blocks; and more generally: exporting said serialized and linked/semantic metadata to common data standards, e.g. datacite, bdbag/bagit, rocrate, or importing from it. i.e. translation.

@mih these are some main points I took from our discussion recently, feel free to add/edit. ## What do we want to achieve? We have a few specific use cases, that can all be generalized in order to develop tools that are reusable. ### 1. An electronic case report tool Generally: https://en.wikipedia.org/wiki/Case_report_form The idea is that a group wants to plan a study to collect data, typically (but not exclusively) from participants. They will have an idea about what the data would look like. The would get together and create a plan, along the lines of: - The data will be collected from a collection of participants / sources having these characteristics - The data will be collected in the form of measurements / samples / observations using these methods or tools - The data will be collected over the course of these sessions and at these sites - The data will be produced in these formats and so forth... What they are basically doing is putting together a semantic structure for what data points in their eventual dataset(s) would look like, and according to which data should eventually be collected, entered, and validated. Thus, defining a schema for data entry and validation. Part of the electronic case reporting process is then to actually collect (and eventually validate) the data during the study, using such a tool. ### 2. A data annotation tool/process Similar to the data entry step in the electronic case reporting use case, here a group of people with similar research interests might want to describe their existing (or planned/evolving) datasets in a way that make them findable in a common catalog, or annotate them with metadata fields that are required for archiving purposes at a particular research site. A common example is the metadata that someone is required to enter if they want to upload their dataset to some repository, such as [OpenNeuro](https://openneuro.org/) or a [Dataverse](https://dataverse.org/) instance, or the [SFB1451 catalog](https://data.sfb1451.de/) - think authors, data controllers, keywords, date created, linked resourced, and the like. These attributes/annotations could of course be vastly different, semantically and structurally, depending on the use case. Here, again, a planning step would involve putting together a semantic structure for what the annotations should look like, and a form that is made available for data creators/maintainers to annotate their datasets should follow this structure and ideally validate the data entries. In addition, the outcome should be that the datasets and their content are linked to the new annotations, which would enable a machine-actionable process of finding actual data files by "searching" for annotations. ### 3. A way to represent and find annotated (DataLad) datasets in a catalog Discoverability is the core principle here, but also discoverability with accessibility as the eventual goal. People want or need a way to advertise their data, either privately to a contained group, or publicly (even when the actual data content should be kept private for data privacy/sensitivity/security reasons). Users should be able to employ searching/filtering functionality in order to find relevant (parts of) datasets, e.g. show me all the datasets that have Parkinson's patients as participants and where the participants are older than 60 years, or show me all the samples of some protein that were extracted from a particular type of tree bark in a particular region during the year 2021. Once datasets are found, they should be presented in an intuitive way, and this should be the same information that a dataset would have been annotated with in order to form part of the catalog, and it might also even present the data files and even their content. ### 4. DataLad-based access to a non-DataLad dataset / store / portal The goal here is to provide a simple means for e.g. a portal operator (such as the catalog described above) to expose essential metadata for automated, on-demand DataLad dataset generation that requires no or minimal dedicated implementation for data access via DataLad. If such a service does not have to run DataLad/git, if it can make metadata available via standard access methods (e.g. http/s), and if the means exist to generate a DataLad dataset from a required set of metadata descriptors, it would mean: - lowering the threshold for adoption of DataLad for data/file retrieval - compatibility between DataLad-based and non-DataLad-based systems can be established without having to adopt a DataLad software stack - custom datasets can be put together from different sources (think: run a query in a catalog to find data samples that satisfy specific criteria, aka find a "cohort", and generate a fully functional DataLad dataset from the results) ## What underlies all of the above? These use cases, though different in their purposes and application, are actually all part of the same problem space. By analysing them together, we can identify a few core aspects that underlie all of them, or from which all of them will benefit. ### A) Schema/ontology development Defining a semantic structure for the form that data can take lies at the core of each of these use cases. When you investigate each of them, it becomes evident that we need a schema for: - defining the structure of the data that will be collected in a study, using an electronic case reporting tool - validating data collected during the study using an electronic case reporting tool - defining the structure of the metadata that will be used to annotate datasets going into a catalog - validating the metadata collected for the purpose of representing a dataset in a catalog - defining the essential concepts required in a set of metadata in order to generate a functional DataLad dataset from it These purposes can be summarized as: **modeling** and **validation** #### Existing work We have already covered ground in this domain, primarily developing the [DataLad-Concepts-Ontology](https://github.com/psychoinformatics-de/datalad-concepts)-based schemas such as `Thing`, `Entity`, `Agent`, `Activity`, `Distribution`, etc. These were heavily influenced by [DCATv3](https://www.w3.org/TR/vocab-dcat-3/) and were created in YAML format using the linked data modeling tool (a.k.a. schema authoring tool) [LinkML](https://linkml.io/) The benefits of these schemas, and LinkML in particular, are: - the structured schema concepts (aka classes) provide the building blocks from which to build practical schemas to send out into the wild (e.g. a study with participants and collected measures and generated files) - the structured schema concepts (aka classes) provide the building blocks from which to describe the essential components of a DataLad dataset (e.g. commits and provenance related to file distributions, and data access specifications) - LinkML enables data validation according to to a given schema - LinkML ensures access to the broader world of linked data and RDF by providing generators for schemas into various formats (e.g. SHACL, jsonschema, OWL, etc) and providing translation between data formats (TTL, YAML, json-ld, etc) - The expression of schemas and data as linked metadata gives us the framework for linking seemingly separate but related concepts through semantic relationships. ### B) Automatic user interface generation A second, and essential, use of schemas in these use cases is **user interface generation**. In principle, if we have a machine-actionable definition of the structure of a dataset or some related concepts, i.e. a schema, we can use this to generate a user interface, or more specifically: **editors** and **viewers**. This means we can: - not only define what the collected data in a case reporting process should look like, but also use this definition to generate a form to actually collect the study data - not only define what the annotations for inclusion of a dataset in a catalog should look like, but also use this definition to generate a form for users to annotate and submit their datasets to the catalog - enable real-time validation of data entered into the automatically generated forms: if we know a particular field is of type `String` and shouldn't be longer than 40 characters and it can have multiple values, i.e. form a list, then we can automatically generate input-specific validation rules that are assessed as these values are entered (which prevents faulty or missing data) - build automatic viewers/renderers based on the semantic structure of specific concepts/items: if we know a particular metadata item is of type `Dataset` we can build a viewer that should display a dataset and all of its expected fields in deterministic way, as defined in the schema. - build query interfaces that determine its query-variables automatically from the structure of the linked data that it should search. Linking this back to LinkML and semantic+linked metadata: - semantically defined and widely agreed upon concepts (such as a `schema:Person` or `dcat:Distribution`) can receive generic viewer or editor components that can be reused across tools / applications / domains - if the schema language used to author a schema (such as LinkML YAML) is itself subject to its own schema constraints (which is the case for LinkML), such autogeneration tools can be used to create a generic schema authoring tool (aka "schema-ception") #### Existing work We have already done work in this domain of automatic user interface generation. The first iteration was [datalad-catalog](https://github.com/datalad/datalad-catalog), which had the aspect of automatic generation of a data catalog and its entries from a schema and metadata, although was missing the linked data building block. The current iteration is [shacl-vue](https://github.com/psychoinformatics-de/shacl-vue). For future development, the particular tech stack (such as VueJS and RDF-Ext used in shacl-vue) is not necessarily the most important consideration, and can actually be a deterrent to progress when considering the bus factor. What is important at this stage is to continue developing in order to identify underlying functional components of such autogeneration tools, which can then be replicated and improved upon in future iterations irrespective of the chosen tech stack. ### C) A serialized and portable format for (DataLad) datasets and their annotations Expressing datasets and their relations as semantic and linked metadata in a simple format (think RDF triples in a plain text file) is something that follows naturally from the implementation and use of the tools described above. Such a process, and the resulting serialized metadata, have several benefits: - **portability**: a complete set of descriptions of a dataset and its content can be serialized and transported easily and separately from the data or content itself, in a plain and simple format that doesn't require proprietary readers or complex services to run on special infrastructure (DataLad doesn't have to be installed on the server, a graph store doesn't have to run wherever a metadata record is kept) - **machine-actionability**: basically, we can generate a DataLad dataset from the metadata, we can render a catalog record of the metadata, ... - **interoperability**: RDF is ubiquitous in the world wide web. By deciding to learn and "speak" it, we make DataLad datasets or just any concepts defined and annotated by our tools compatible with it, meaning a wide array of existing tools would become available. #### Existing work - Initial work on generating metadata from a DataLad dataset was done in the [`datalad-metalad` extension](https://github.com/datalad/datalad-metalad), and several extractors have been added to that repertoire (`metalad_core` for datasets and files, `bids_dataset`, `datacite_gin`, ...). Extractors take a strutured dataset with specific attributes as the input, and outputs a JSON-serialized metadata record. For interoperability with `datalad-catalog` and its schema, appropriate translators were also developed. - The [ebrains](https://github.com/datalad/datalad-ebrains) extension implements a conversion of a standardized metadata record, generated via a graph query, into an operational DataLad dataset. - [datalad-tabby](https://github.com/psychoinformatics-de/datalad-tabby) ## So what should our practical focus be? Concisely: 1. continued **development of `datalad-concepts`** to improve all building blocks (i.e. base and derived classes) for use in authoring custom schemas 2. an _intuitive and user-friendly_ **schema authoring tool**, that can make use of these building blocks 3. tool(s) for **automatic generation of forms (editors) and catalogs (viewers)** from schemas; currently => `shacl-vue` - for forms: the ability to "upload/import" a set of existing semantic metadata, add to and edit it (i.e. extend the graph), and then "download" the resulting metadata set - the ability to export a graph of linked metadata to various serialized formats - for catalogs: a scalable and automatic query interface 4. Tools for validating a set of linked metadata against a schema 5. Tools for serialization and de-serialization of DataLad datasets using the `datalad-concepts` building blocks; and more generally: exporting said serialized and linked/semantic metadata to common data standards, e.g. `datacite`, `bdbag/bagit`, `rocrate`, or importing from it. i.e. translation.