Add enrich-via-doi #2
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "msz/knowledge-enrichment:enrich-doi"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
This add a script for enriching publication records through doi.org content negotiation based on my earlier work on TRR page generator.
It does author matching (by ORCID, which we've seen is rarely going to be complete, but should be unambiguous), license ID mapping based on spdx file (e.g. from creative commons canonical URLs used by crossref to SPDX PIDs which we use) on top of filling in the fields like title, date, etc. It would not overwrite existing fields, only add if they are missing (an exception is date, when it will be changed if external source has more (yyyy, mm, dd) components).
There is some "flavour" which I'm not sure whether it fits with the current repo. For example,
$PWD/.cacheis used for caching (doi.org requests and spdx license file) which speeds up re-runs - less important for actions, but very useful for hands-on experimentation. Also, the CLI is in the form of INPUT PERSONS OUTPUT (with-used for stdin/stdout) which is different from the other script in this repo, but I find it very natural (PERSONS in this case are all person records, used for cross-referencing by ORCID). I'm happy to discuss how this could be made more unified. Also, I am not adding an action currently, not knowing whether it would be useful.Dividing into commits was useful mostly for me; if merging, squash would be a good option.
ORINOCO project people: do you thing this is anyhow useful?
Example usage
(updated)
Preparation (get person records for cross-referencing):
Take a record from the pool and (for demonstration purposes) trim it to just PID, identifiers, and schema type, then feed it to the enrichment script together with the previously fetched person and rule records:
I did not want to set a pattern with the very first enrichment tool I added to this repo, so I don't think there is a need to unify to something in this repo. At the same time, I'm also personally fully fine with different tools working in various ways
I think the tool is very useful. Thank you for bringing it here! I think there is general use for someone bulk processing lots of publications with this, but I'm wondering if it could also help an individual "be lazy" in the editor with the help of CI. Currently the workflow is "edit in the web editor" -> "submit" (-> "curate") -> "run script locally" -> "submit" -> "curate", and the shift from web to cmd is probably a bit of a hurdle (at least I dread it - either one at a time is fine, but I hate switching). If this could be an action that one can trigger on demand and that self-submits, one could maybe do "add doi of publication" -> "submit" -> "trigger action" -> "curate enriched record"?
Because a Rule record can declare exact mappings, we will use them instead of an ad-hoc downloaded spdx file. This makes the process more self-contained. This means that there is more reliance on the information maintained in the pool (vs. reliance on the use of external identifiers and information available through them from elsewhere) but in the case of mapping license identifiers (in practice, between spdx and creative commons namespaces) this seems to be in line with the spirit of things. One thing I wasn't quite sure about are trailing "/" on the identifiers ("canonical URLs" for creative commons do have them) and whether they should be allowed / expected in the exact mappings. The comparisons for exact mappings are done with the trailing "/" stripped to be on the safe side.Thanks for the encouragement @adina! I made a few changes which I think make the way it works pretty nice (described in the commits, although I still expect them to be squashed). Now, licenses will be matched only against supplied Rule records (using PID and exact mappings), not the SPDX data. Both Person and Rule records now need to come as file-type options (so matching authors and licenses becomes optional). I updated the example above.
Regarding usage in actions: yes, I think "be lazy" is a great usage pattern. So far I used it like that but locally: get an incomplete record from the pool, run enrichment from my terminal (as in the example above, but without stripping any information), submit the resulting JSON, finish editing in the UI.
One thing I am not yet sure with actions: should we limit the action to either selected, incomplete, or most recent records (and how)? We currently have 72 publication records, and 72 requests to doi is probably not too much in the grand scheme of things, but I want to be nice and not re-query all publications every time the workflow runs. I'm open for hearing ideas.
Good question. I think there are several valid use cases. I definitely think one may have a big bulk of publications and would want to run an overall check if they are complete, and as the pool could over time also grow in person records, this may even be something that one would want to run every once in a while (in my "being lazy" thinking, this would spare a person adding a person record to work through all publications and add that person as an author). So for that usecase, running it (occasionally, not every time) on entire collections (public/protected) would make sense to me.
But I also think there is the usecase of "I only have one or two publications I just submitted". I wonder if specifying a users' inbox could be used to subselect records? Then one can submit the "unfinished" record, have a enricher complement it with additional information, and a user reviews and curates the finished record.
This adds a workflow which runs the publication enrichment via doi.org. Given that the DOI org information will change very rarely, and we don't (yet) have ways to say "this record is complete / needs no enrichment", the workflow currently only has a "workflow dispatch" trigger. Two optional inputs can be specified when dispatching the workflow: list of PIDs and inbox label. These will limit processing to a subset of records. Otherwise, all records will be processed. Properties which can change based on the pool / data model (API URL, collection name, class names) are kept as env variables to make tweaks easier. In the last step (process record), inputs are assigned (export) to environment variables to avoid issues when the runner is filling them in (eg. end of line after `<<<` when pids are not provided was a syntax error). To supply the optional `--incoming label` argument to dtc get-records, parameter expansion is used (`${parameter:+word}` expands to nothing if parameter is null or unset, otherwise expansion of word is used).@adina I added a forgejo workflow which would run the provided action. I tested it in another repo (with different URL to fetch the script and with
--dry-runadded to thedtc post-record), it seemed to work nicely.Following your suggestions, I added parameters to the workflow (they can be entered when using the manual workflow dispatch, and probably also via Forgejo API), so that one or more PIDs can be specified, or an inbox label. These would restrict processing to those records. It complicated the bash script in the last step a little, but I think it should be pretty legible still.
One issue with the inbox - as long as we assume that the action runs with a token dedicated to actions (separate pool user, so to speak) then the inbox label maybe isn't really that helpful, as the records would end up in a different inbox, so the curation could be a bit involved. But maybe that's fine.
From my point of view, this PR should be complete now, but if you have further suggestions, feel free to let me know. Please do keep in mind that I'm not authorized to merge the PR, so in the end it will need a decision.
4985aeaa80tod340ec508c