Add enrich-via-doi #2

Merged

adina merged 10 commits from msz/knowledge-enrichment:enrich-doi into main

2026-03-26 04:30:57 +00:00

Author	SHA1	Message	Date
Michał Szczepanik	37b0d1d30d	Mention the workflow in the readme	2026-03-25 16:49:45 +01:00
Michał Szczepanik	d340ec508c	Add enrich publications workflow This adds a workflow which runs the publication enrichment via doi.org. Given that the DOI org information will change very rarely, and we don't (yet) have ways to say "this record is complete / needs no enrichment", the workflow currently only has a "workflow dispatch" trigger. Two optional inputs can be specified when dispatching the workflow: list of PIDs and inbox label. These will limit processing to a subset of records. Otherwise, all records will be processed. Properties which can change based on the pool / data model (API URL, collection name, class names) are kept as env variables to make tweaks easier. In the last step (process record), inputs are assigned (export) to environment variables to avoid issues when the runner is filling them in (eg. end of line after `<<<` when pids are not provided was a syntax error). To supply the optional `--incoming label` argument to dtc get-records, parameter expansion is used (`${parameter:+word}` expands to nothing if parameter is null or unset, otherwise expansion of word is used).	2026-03-25 16:09:47 +01:00
Michał Szczepanik	4dda0f3c8b	Treat ISSN in csl+json metadata as optional	2026-03-25 15:07:10 +01:00
Michał Szczepanik	8a8c7bda20	Mention DOI enrichment in the README	2026-03-17 19:48:29 +01:00
Michał Szczepanik	1c4b1f4fe9	Make person records optional for DOI enrichment This changes the CLI to only have INPUT and OUTPUT as arguments; additional records (Person and Rule) now need to be provided as options. If not provided, respective part of enrichmant won't be performed. With neither Person nor Rule, enrichment can still add date, ISSN, title, and abstract.	2026-03-16 18:01:40 +01:00
Michał Szczepanik	b7a6f70114	Use Rule records, not SPDX, for license discovery Because a Rule record can declare exact mappings, we will use them instead of an ad-hoc downloaded spdx file. This makes the process more self-contained. This means that there is more reliance on the information maintained in the pool (vs. reliance on the use of external identifiers and information available through them from elsewhere) but in the case of mapping license identifiers (in practice, between spdx and creative commons namespaces) this seems to be in line with the spirit of things. One thing I wasn't quite sure about are trailing "/" on the identifiers ("canonical URLs" for creative commons do have them) and whether they should be allowed / expected in the exact mappings. The comparisons for exact mappings are done with the trailing "/" stripped to be on the safe side.	2026-03-16 17:50:11 +01:00
Michał Szczepanik	f83aee9da4	Switch DOI enrichment to work on PIDs, not inlined records Using the script for record enrichment (ie. feeding data back into the pool) means that the record we produce should not contain inlined items. We still need access to all known person records (to match external metadata with existing records) but in those we are only interested in the ORCID IDs. So the input (publication) records do not require prior inlining of attribution objects. What we really need is a bidirectional mapping: from PID to ORCID (to know which contributors are already credited) and from ORCID to PID (to add more contributors). This functionality is conveniently provided by bidict, which is an external dependency but it is a tiny one (33 kB wheel). This change allows the code to work with records containing attributions in which objects are not inlined. In the current form, we lose the ability to work with inlined records (this can be brought back by looking up record's pid) but in the enrichment context we don't need that, and not inlining is leaner. In this context, any rendering (e.g. website) would probably use the record after it has been submitted back into the pool.	2026-03-13 20:53:58 +01:00
Michał Szczepanik	7972ff4213	Add only PID when enriching publication with attributions Because of the origins as part of a page generator, the Person enrichment added entire Person records. API submission only needs PIDs. This one-line change adapts the script to use in API submission. Page generators can inline the records if needed.	2026-03-09 18:20:59 +01:00
Michał Szczepanik	a4916305ca	Tweak doi enrichment Remove TRR prefix, temporarily disable cache, be flexible for inlined/pid-only, add help, change the regular script to uv script.	2026-03-09 17:04:46 +01:00
Michał Szczepanik	c3ecf7e8a2	Copy enrich-via-doi from TRR's pool-publication-page	2026-03-09 17:02:57 +01:00