Thoughts/ideas on the general "datalink" application landscape #8
Labels
No labels
bug
duplicate
enhancement
help wanted
invalid
question
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
orinoco/tools#8
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
@mih these are some main points I took from our discussion recently, feel free to add/edit.
What do we want to achieve?
We have a few specific use cases, that can all be generalized in order to develop tools that are reusable.
1. An electronic case report tool
Generally: https://en.wikipedia.org/wiki/Case_report_form
The idea is that a group wants to plan a study to collect data, typically (but not exclusively) from participants. They will have an idea about what the data would look like. The would get together and create a plan, along the lines of:
and so forth...
What they are basically doing is putting together a semantic structure for what data points in their eventual dataset(s) would look like, and according to which data should eventually be collected, entered, and validated. Thus, defining a schema for data entry and validation. Part of the electronic case reporting process is then to actually collect (and eventually validate) the data during the study, using such a tool.
2. A data annotation tool/process
Similar to the data entry step in the electronic case reporting use case, here a group of people with similar research interests might want to describe their existing (or planned/evolving) datasets in a way that make them findable in a common catalog, or annotate them with metadata fields that are required for archiving purposes at a particular research site. A common example is the metadata that someone is required to enter if they want to upload their dataset to some repository, such as OpenNeuro or a Dataverse instance, or the SFB1451 catalog - think authors, data controllers, keywords, date created, linked resourced, and the like. These attributes/annotations could of course be vastly different, semantically and structurally, depending on the use case.
Here, again, a planning step would involve putting together a semantic structure for what the annotations should look like, and a form that is made available for data creators/maintainers to annotate their datasets should follow this structure and ideally validate the data entries. In addition, the outcome should be that the datasets and their content are linked to the new annotations, which would enable a machine-actionable process of finding actual data files by "searching" for annotations.
3. A way to represent and find annotated (DataLad) datasets in a catalog
Discoverability is the core principle here, but also discoverability with accessibility as the eventual goal. People want or need a way to advertise their data, either privately to a contained group, or publicly (even when the actual data content should be kept private for data privacy/sensitivity/security reasons).
Users should be able to employ searching/filtering functionality in order to find relevant (parts of) datasets, e.g. show me all the datasets that have Parkinson's patients as participants and where the participants are older than 60 years, or show me all the samples of some protein that were extracted from a particular type of tree bark in a particular region during the year 2021.
Once datasets are found, they should be presented in an intuitive way, and this should be the same information that a dataset would have been annotated with in order to form part of the catalog, and it might also even present the data files and even their content.
4. DataLad-based access to a non-DataLad dataset / store / portal
The goal here is to provide a simple means for e.g. a portal operator (such as the catalog described above) to expose essential metadata for automated, on-demand DataLad dataset generation that requires no or minimal dedicated implementation for data access via DataLad. If such a service does not have to run DataLad/git, if it can make metadata available via standard access methods (e.g. http/s), and if the means exist to generate a DataLad dataset from a required set of metadata descriptors, it would mean:
What underlies all of the above?
These use cases, though different in their purposes and application, are actually all part of the same problem space. By analysing them together, we can identify a few core aspects that underlie all of them, or from which all of them will benefit.
A) Schema/ontology development
Defining a semantic structure for the form that data can take lies at the core of each of these use cases. When you investigate each of them, it becomes evident that we need a schema for:
These purposes can be summarized as: modeling and validation
Existing work
We have already covered ground in this domain, primarily developing the DataLad-Concepts-Ontology-based schemas such as
Thing,Entity,Agent,Activity,Distribution, etc. These were heavily influenced by DCATv3 and were created in YAML format using the linked data modeling tool (a.k.a. schema authoring tool) LinkMLThe benefits of these schemas, and LinkML in particular, are:
B) Automatic user interface generation
A second, and essential, use of schemas in these use cases is user interface generation. In principle, if we have a machine-actionable definition of the structure of a dataset or some related concepts, i.e. a schema, we can use this to generate a user interface, or more specifically: editors and viewers. This means we can:
Stringand shouldn't be longer than 40 characters and it can have multiple values, i.e. form a list, then we can automatically generate input-specific validation rules that are assessed as these values are entered (which prevents faulty or missing data)Datasetwe can build a viewer that should display a dataset and all of its expected fields in deterministic way, as defined in the schema.Linking this back to LinkML and semantic+linked metadata:
schema:Personordcat:Distribution) can receive generic viewer or editor components that can be reused across tools / applications / domainsExisting work
We have already done work in this domain of automatic user interface generation. The first iteration was datalad-catalog, which had the aspect of automatic generation of a data catalog and its entries from a schema and metadata, although was missing the linked data building block. The current iteration is shacl-vue.
For future development, the particular tech stack (such as VueJS and RDF-Ext used in shacl-vue) is not necessarily the most important consideration, and can actually be a deterrent to progress when considering the bus factor. What is important at this stage is to continue developing in order to identify underlying functional components of such autogeneration tools, which can then be replicated and improved upon in future iterations irrespective of the chosen tech stack.
C) A serialized and portable format for (DataLad) datasets and their annotations
Expressing datasets and their relations as semantic and linked metadata in a simple format (think RDF triples in a plain text file) is something that follows naturally from the implementation and use of the tools described above. Such a process, and the resulting serialized metadata, have several benefits:
Existing work
datalad-metaladextension, and several extractors have been added to that repertoire (metalad_corefor datasets and files,bids_dataset,datacite_gin, ...). Extractors take a strutured dataset with specific attributes as the input, and outputs a JSON-serialized metadata record. For interoperability withdatalad-catalogand its schema, appropriate translators were also developed.So what should our practical focus be?
Concisely:
datalad-conceptsto improve all building blocks (i.e. base and derived classes) for use in authoring custom schemasshacl-vuedatalad-conceptsbuilding blocks; and more generally: exporting said serialized and linked/semantic metadata to common data standards, e.g.datacite,bdbag/bagit,rocrate, or importing from it. i.e. translation.