Add enrich-via-doi #2

Merged
adina merged 10 commits from msz/knowledge-enrichment:enrich-doi into main 2026-03-26 04:30:57 +00:00

10 commits

Author SHA1 Message Date
37b0d1d30d Mention the workflow in the readme 2026-03-25 16:49:45 +01:00
d340ec508c Add enrich publications workflow
This adds a workflow which runs the publication enrichment via doi.org.

Given that the DOI org information will change very rarely, and we don't
(yet) have ways to say "this record is complete / needs no enrichment",
the workflow currently only has a "workflow dispatch" trigger.

Two optional inputs can be specified when dispatching the workflow: list
of PIDs and inbox label. These will limit processing to a subset of
records. Otherwise, all records will be processed.

Properties which can change based on the pool / data model (API URL,
collection name, class names) are kept as env variables to make tweaks
easier.

In the last step (process record), inputs are assigned (export) to
environment variables to avoid issues when the runner is filling them in
(eg. end of line after `<<<` when pids are not provided was a syntax
error). To supply the optional `--incoming label` argument to dtc
get-records, parameter expansion is used (`${parameter:+word}` expands
to nothing if parameter is null or unset, otherwise expansion of word is
used).
2026-03-25 16:09:47 +01:00
4dda0f3c8b Treat ISSN in csl+json metadata as optional 2026-03-25 15:07:10 +01:00
8a8c7bda20 Mention DOI enrichment in the README 2026-03-17 19:48:29 +01:00
1c4b1f4fe9 Make person records optional for DOI enrichment
This changes the CLI to only have INPUT and OUTPUT as arguments;
additional records (Person and Rule) now need to be provided as options.
If not provided, respective part of enrichmant won't be performed.

With neither Person nor Rule, enrichment can still add date, ISSN,
title, and abstract.
2026-03-16 18:01:40 +01:00
b7a6f70114 Use Rule records, not SPDX, for license discovery
Because a Rule record can declare exact mappings, we will use them
instead of an ad-hoc downloaded spdx file. This makes the process more
self-contained.

This means that there is more reliance on the information maintained in
the pool (vs. reliance on the use of external identifiers and
information available through them from elsewhere) but in the case of
mapping license identifiers (in practice, between spdx and creative
commons namespaces) this seems to be in line with the spirit of things.

One thing I wasn't quite sure about are trailing "/" on the identifiers
("canonical URLs" for creative commons do have them) and whether they
should be allowed / expected in the exact mappings. The comparisons for
exact mappings are done with the trailing "/" stripped to be on the safe
side.
2026-03-16 17:50:11 +01:00
f83aee9da4 Switch DOI enrichment to work on PIDs, not inlined records
Using the script for record enrichment (ie. feeding data back into the
pool) means that the record we produce should not contain inlined items.
We still need access to all known person records (to match external
metadata with existing records) but in those we are only interested in
the ORCID IDs. So the input (publication) records do not require prior
inlining of attribution objects.

What we really need is a bidirectional mapping: from PID to ORCID (to
know which contributors are already credited) and from ORCID to PID (to
add more contributors). This functionality is conveniently provided by
bidict, which is an external dependency but it is a tiny one (33 kB
wheel).

This change allows the code to work with records containing attributions
in which objects are not inlined. In the current form, we lose the
ability to work with inlined records (this can be brought back by
looking up record's pid) but in the enrichment context we don't need
that, and not inlining is leaner. In this context, any rendering (e.g.
website) would probably use the record after it has been submitted back
into the pool.
2026-03-13 20:53:58 +01:00
7972ff4213 Add only PID when enriching publication with attributions
Because of the origins as part of a page generator, the Person
enrichment added entire Person records. API submission only needs PIDs.
This one-line change adapts the script to use in API submission. Page
generators can inline the records if needed.
2026-03-09 18:20:59 +01:00
a4916305ca Tweak doi enrichment
Remove TRR prefix, temporarily disable cache, be flexible for
inlined/pid-only, add help, change the regular script to uv script.
2026-03-09 17:04:46 +01:00
c3ecf7e8a2 Copy enrich-via-doi from TRR's pool-publication-page 2026-03-09 17:02:57 +01:00