Add enrich-via-doi #2

Merged
adina merged 10 commits from msz/knowledge-enrichment:enrich-doi into main 2026-03-26 04:30:57 +00:00
Member

This add a script for enriching publication records through doi.org content negotiation based on my earlier work on TRR page generator.

It does author matching (by ORCID, which we've seen is rarely going to be complete, but should be unambiguous), license ID mapping based on spdx file (e.g. from creative commons canonical URLs used by crossref to SPDX PIDs which we use) on top of filling in the fields like title, date, etc. It would not overwrite existing fields, only add if they are missing (an exception is date, when it will be changed if external source has more (yyyy, mm, dd) components).

There is some "flavour" which I'm not sure whether it fits with the current repo. For example, $PWD/.cache is used for caching (doi.org requests and spdx license file) which speeds up re-runs - less important for actions, but very useful for hands-on experimentation. Also, the CLI is in the form of INPUT PERSONS OUTPUT (with - used for stdin/stdout) which is different from the other script in this repo, but I find it very natural (PERSONS in this case are all person records, used for cross-referencing by ORCID). I'm happy to discuss how this could be made more unified. Also, I am not adding an action currently, not knowing whether it would be useful.

Dividing into commits was useful mostly for me; if merging, squash would be a good option.

ORINOCO project people: do you thing this is anyhow useful?

Example usage

(updated)

Preparation (get person records for cross-referencing):

export DUMPTHINGS_APIURL=https://pool.psychoinformatics.de/api
mkdir .cache
dtc get-records ${DUMPTHINGS_APIURL} public -C XYZPerson > .cache/Person.jsonl
dtc get-records ${DUMPTHINGS_APIURL} public -C Rule > .cache/Rule.jsonl

Take a record from the pool and (for demonstration purposes) trim it to just PID, identifiers, and schema type, then feed it to the enrichment script together with the previously fetched person and rule records:

dtc get-records ${DUMPTHINGS_APIURL} --pid xyzrins:publications/9103ab93-a2ae-45c2-b2b0-e922bbb4e94a public |
  jq -c "{pid, identifiers, schema_type}" |
  uv run .forgejo/tools/enrich-via-doi.py --persons .cache/Person.jsonl --rules .cache/Rule.jsonl - - |
  jq
This add a script for enriching publication records through doi.org [content negotiation](https://www.crossref.org/documentation/retrieve-metadata/content-negotiation/) based on my earlier work on [TRR page generator](https://hub.trr379.de/q02/pool-publication-page). It does author matching (by ORCID, which we've seen is rarely going to be complete, but should be unambiguous), license ID mapping based on spdx file (e.g. from creative commons canonical URLs used by crossref to SPDX PIDs which we use) on top of filling in the fields like title, date, etc. It would not overwrite existing fields, only add if they are missing (an exception is date, when it will be changed if external source has more (yyyy, mm, dd) components). There is some "flavour" which I'm not sure whether it fits with the current repo. For example, `$PWD/.cache` is used for caching (doi.org requests and spdx license file) which speeds up re-runs - less important for actions, but very useful for hands-on experimentation. Also, the CLI is in the form of INPUT PERSONS OUTPUT (with `-` used for stdin/stdout) which is different from the other script in this repo, but I find it very natural (PERSONS in this case are all person records, used for cross-referencing by ORCID). I'm happy to discuss how this could be made more unified. Also, I am not adding an action currently, not knowing whether it would be useful. Dividing into commits was useful mostly for me; if merging, squash would be a good option. ORINOCO project people: do you thing this is anyhow useful? ### Example usage (updated) Preparation (get person records for cross-referencing): ``` export DUMPTHINGS_APIURL=https://pool.psychoinformatics.de/api mkdir .cache dtc get-records ${DUMPTHINGS_APIURL} public -C XYZPerson > .cache/Person.jsonl dtc get-records ${DUMPTHINGS_APIURL} public -C Rule > .cache/Rule.jsonl ``` Take a record from the pool and (for demonstration purposes) trim it to just PID, identifiers, and schema type, then feed it to the enrichment script together with the previously fetched person and rule records: ``` dtc get-records ${DUMPTHINGS_APIURL} --pid xyzrins:publications/9103ab93-a2ae-45c2-b2b0-e922bbb4e94a public | jq -c "{pid, identifiers, schema_type}" | uv run .forgejo/tools/enrich-via-doi.py --persons .cache/Person.jsonl --rules .cache/Rule.jsonl - - | jq ```
Remove TRR prefix, temporarily disable cache, be flexible for
inlined/pid-only, add help, change the regular script to uv script.
Because of the origins as part of a page generator, the Person
enrichment added entire Person records. API submission only needs PIDs.
This one-line change adapts the script to use in API submission. Page
generators can inline the records if needed.
Owner

There is some "flavour" which I'm not sure whether it fits with the current repo. For example, $PWD/.cache is used for caching (doi.org requests and spdx license file) which speeds up re-runs - less important for actions, but very useful for hands-on experimentation. Also, the CLI is in the form of INPUT PERSONS OUTPUT (with - used for stdin/stdout) which is different from the other script in this repo, but I find it very natural (PERSONS in this case are all person records, used for cross-referencing by ORCID). I'm happy to discuss how this could be made more unified.

I did not want to set a pattern with the very first enrichment tool I added to this repo, so I don't think there is a need to unify to something in this repo. At the same time, I'm also personally fully fine with different tools working in various ways

Also, I am not adding an action currently, not knowing whether it would be useful.

I think the tool is very useful. Thank you for bringing it here! I think there is general use for someone bulk processing lots of publications with this, but I'm wondering if it could also help an individual "be lazy" in the editor with the help of CI. Currently the workflow is "edit in the web editor" -> "submit" (-> "curate") -> "run script locally" -> "submit" -> "curate", and the shift from web to cmd is probably a bit of a hurdle (at least I dread it - either one at a time is fine, but I hate switching). If this could be an action that one can trigger on demand and that self-submits, one could maybe do "add doi of publication" -> "submit" -> "trigger action" -> "curate enriched record"?

> There is some "flavour" which I'm not sure whether it fits with the current repo. For example, $PWD/.cache is used for caching (doi.org requests and spdx license file) which speeds up re-runs - less important for actions, but very useful for hands-on experimentation. Also, the CLI is in the form of INPUT PERSONS OUTPUT (with - used for stdin/stdout) which is different from the other script in this repo, but I find it very natural (PERSONS in this case are all person records, used for cross-referencing by ORCID). I'm happy to discuss how this could be made more unified. I did not want to set a pattern with the very first enrichment tool I added to this repo, so I don't think there is a need to unify to something in this repo. At the same time, I'm also personally fully fine with different tools working in various ways > Also, I am not adding an action currently, not knowing whether it would be useful. I think the tool is very useful. Thank you for bringing it here! I think there is general use for someone bulk processing lots of publications with this, but I'm wondering if it could also help an individual "be lazy" in the editor with the help of CI. Currently the workflow is "edit in the web editor" -> "submit" (-> "curate") -> "run script locally" -> "submit" -> "curate", and the shift from web to cmd is probably a bit of a hurdle (at least I dread it - either one at a time is fine, but I hate switching). If this could be an action that one can trigger on demand and that self-submits, one could maybe do "add doi of publication" -> "submit" -> "trigger action" -> "curate enriched record"?
Using the script for record enrichment (ie. feeding data back into the
pool) means that the record we produce should not contain inlined items.
We still need access to all known person records (to match external
metadata with existing records) but in those we are only interested in
the ORCID IDs. So the input (publication) records do not require prior
inlining of attribution objects.

What we really need is a bidirectional mapping: from PID to ORCID (to
know which contributors are already credited) and from ORCID to PID (to
add more contributors). This functionality is conveniently provided by
bidict, which is an external dependency but it is a tiny one (33 kB
wheel).

This change allows the code to work with records containing attributions
in which objects are not inlined. In the current form, we lose the
ability to work with inlined records (this can be brought back by
looking up record's pid) but in the enrichment context we don't need
that, and not inlining is leaner. In this context, any rendering (e.g.
website) would probably use the record after it has been submitted back
into the pool.
Because a Rule record can declare exact mappings, we will use them
instead of an ad-hoc downloaded spdx file. This makes the process more
self-contained.

This means that there is more reliance on the information maintained in
the pool (vs. reliance on the use of external identifiers and
information available through them from elsewhere) but in the case of
mapping license identifiers (in practice, between spdx and creative
commons namespaces) this seems to be in line with the spirit of things.

One thing I wasn't quite sure about are trailing "/" on the identifiers
("canonical URLs" for creative commons do have them) and whether they
should be allowed / expected in the exact mappings. The comparisons for
exact mappings are done with the trailing "/" stripped to be on the safe
side.
This changes the CLI to only have INPUT and OUTPUT as arguments;
additional records (Person and Rule) now need to be provided as options.
If not provided, respective part of enrichmant won't be performed.

With neither Person nor Rule, enrichment can still add date, ISSN,
title, and abstract.
Author
Member

Thanks for the encouragement @adina! I made a few changes which I think make the way it works pretty nice (described in the commits, although I still expect them to be squashed). Now, licenses will be matched only against supplied Rule records (using PID and exact mappings), not the SPDX data. Both Person and Rule records now need to come as file-type options (so matching authors and licenses becomes optional). I updated the example above.

Regarding usage in actions: yes, I think "be lazy" is a great usage pattern. So far I used it like that but locally: get an incomplete record from the pool, run enrichment from my terminal (as in the example above, but without stripping any information), submit the resulting JSON, finish editing in the UI.

One thing I am not yet sure with actions: should we limit the action to either selected, incomplete, or most recent records (and how)? We currently have 72 publication records, and 72 requests to doi is probably not too much in the grand scheme of things, but I want to be nice and not re-query all publications every time the workflow runs. I'm open for hearing ideas.

Thanks for the encouragement @adina! I made a few changes which I think make the way it works pretty nice (described in the commits, although I still expect them to be squashed). Now, licenses will be matched only against supplied Rule records (using PID and exact mappings), not the SPDX data. Both Person and Rule records now need to come as file-type options (so matching authors and licenses becomes optional). I updated the example above. Regarding usage in actions: yes, I think "be lazy" is a great usage pattern. So far I used it like that but locally: get an incomplete record from the pool, run enrichment from my terminal (as in the example above, but without stripping any information), submit the resulting JSON, finish editing in the UI. One thing I am not yet sure with actions: should we limit the action to either selected, incomplete, or most recent records (and how)? We currently have 72 publication records, and 72 requests to doi is probably not too much in the grand scheme of things, but I want to be nice and not re-query all publications every time the workflow runs. I'm open for hearing ideas.
Owner

One thing I am not yet sure with actions: should we limit the action to either selected, incomplete, or most recent records (and how)? We currently have 72 publication records, and 72 requests to doi is probably not too much in the grand scheme of things, but I want to be nice and not re-query all publications every time the workflow runs. I'm open for hearing ideas.

Good question. I think there are several valid use cases. I definitely think one may have a big bulk of publications and would want to run an overall check if they are complete, and as the pool could over time also grow in person records, this may even be something that one would want to run every once in a while (in my "being lazy" thinking, this would spare a person adding a person record to work through all publications and add that person as an author). So for that usecase, running it (occasionally, not every time) on entire collections (public/protected) would make sense to me.
But I also think there is the usecase of "I only have one or two publications I just submitted". I wonder if specifying a users' inbox could be used to subselect records? Then one can submit the "unfinished" record, have a enricher complement it with additional information, and a user reviews and curates the finished record.

> One thing I am not yet sure with actions: should we limit the action to either selected, incomplete, or most recent records (and how)? We currently have 72 publication records, and 72 requests to doi is probably not too much in the grand scheme of things, but I want to be nice and not re-query all publications every time the workflow runs. I'm open for hearing ideas. Good question. I think there are several valid use cases. I definitely think one may have a big bulk of publications and would want to run an overall check if they are complete, and as the pool could over time also grow in person records, this may even be something that one would want to run every once in a while (in my "being lazy" thinking, this would spare a person adding a person record to work through all publications and add that person as an author). So for that usecase, running it (occasionally, not every time) on entire collections (public/protected) would make sense to me. But I also think there is the usecase of "I only have one or two publications I just submitted". I wonder if specifying a users' inbox could be used to subselect records? Then one can submit the "unfinished" record, have a enricher complement it with additional information, and a user reviews and curates the finished record.
This adds a workflow which runs the publication enrichment via doi.org.

Given that the DOI org information will change very rarely, and we don't
(yet) have ways to say "this record is complete / needs no enrichment",
the workflow currently only has a "workflow dispatch" trigger.

Two optional inputs can be specified when dispatching the workflow: list
of PIDs and inbox label. These will limit processing to a subset of
records. Otherwise, all records will be processed.

Properties which can change based on the pool / data model (API URL,
collection name, class names) are kept as env variables to make tweaks
easier.

In the last step (process record), inputs are assigned (export) to
environment variables to avoid issues when the runner is filling them in
(eg. end of line after `<<<` when pids are not provided was a syntax
error). To supply the optional `--incoming label` argument to dtc
get-records, parameter expansion is used (`${parameter:+word}` expands
to nothing if parameter is null or unset, otherwise expansion of word is
used).
Author
Member

@adina I added a forgejo workflow which would run the provided action. I tested it in another repo (with different URL to fetch the script and with --dry-run added to the dtc post-record), it seemed to work nicely.

Following your suggestions, I added parameters to the workflow (they can be entered when using the manual workflow dispatch, and probably also via Forgejo API), so that one or more PIDs can be specified, or an inbox label. These would restrict processing to those records. It complicated the bash script in the last step a little, but I think it should be pretty legible still.

One issue with the inbox - as long as we assume that the action runs with a token dedicated to actions (separate pool user, so to speak) then the inbox label maybe isn't really that helpful, as the records would end up in a different inbox, so the curation could be a bit involved. But maybe that's fine.

From my point of view, this PR should be complete now, but if you have further suggestions, feel free to let me know. Please do keep in mind that I'm not authorized to merge the PR, so in the end it will need a decision.

@adina I added a forgejo workflow which would run the provided action. I tested it in another repo (with different URL to fetch the script and with `--dry-run` added to the `dtc post-record`), it seemed to work nicely. Following your suggestions, I added parameters to the workflow (they can be entered when using the manual workflow dispatch, and probably also via Forgejo API), so that one or more PIDs can be specified, or an inbox label. These would restrict processing to those records. It complicated the bash script in the last step a little, but I think it should be pretty legible still. One issue with the inbox - as long as we assume that the action runs with a token dedicated to actions (separate pool user, so to speak) then the inbox label maybe isn't really that helpful, as the records would end up in a different inbox, so the curation could be a bit involved. But maybe that's fine. From my point of view, this PR should be complete now, but if you have further suggestions, feel free to let me know. Please do keep in mind that I'm not authorized to merge the PR, so in the end it will need a decision.
msz force-pushed enrich-doi from 4985aeaa80 to d340ec508c 2026-03-25 15:10:09 +00:00 Compare
adina merged commit 47b9ddc774 into main 2026-03-26 04:30:56 +00:00
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
orinoco/knowledge-enrichment!2
No description provided.