knowledge-enrichment/.forgejo/workflows/enrich_publications.yml
Michał Szczepanik d340ec508c Add enrich publications workflow
This adds a workflow which runs the publication enrichment via doi.org.

Given that the DOI org information will change very rarely, and we don't
(yet) have ways to say "this record is complete / needs no enrichment",
the workflow currently only has a "workflow dispatch" trigger.

Two optional inputs can be specified when dispatching the workflow: list
of PIDs and inbox label. These will limit processing to a subset of
records. Otherwise, all records will be processed.

Properties which can change based on the pool / data model (API URL,
collection name, class names) are kept as env variables to make tweaks
easier.

In the last step (process record), inputs are assigned (export) to
environment variables to avoid issues when the runner is filling them in
(eg. end of line after `<<<` when pids are not provided was a syntax
error). To supply the optional `--incoming label` argument to dtc
get-records, parameter expansion is used (`${parameter:+word}` expands
to nothing if parameter is null or unset, otherwise expansion of word is
used).
2026-03-25 16:09:47 +01:00

64 lines
2.3 KiB
YAML

name: Enrich publications via doi.org
on:
workflow_dispatch:
inputs:
pids:
description: "Limit to these PIDs (comma-separated)"
required: false
default: ''
type: string
inbox:
description: "Limit to inbox with this label"
required: false
default: ''
type: string
env:
DTC_TOKEN: ${{ secrets.POOLTOKEN }}
DUMPTHINGS_APIURL: https://pool.psychoinformatics.de/api
DUMPTHINGS_COLLECTION: public
PERSON_CLASS: XYZPerson
PUBLICATION_CLASS: XYZPublication
RULE_CLASS: Rule
jobs:
enrich-publications:
name: Enrich publications
runs-on: debian-latest
defaults:
run:
shell: bash
steps:
- name: Install uv
uses: astral-sh/setup-uv@v6
- name: Install metadata tools
run: |
uv tool install https://hub.psychoinformatics.de/orinoco/query-things.git \
--with-executables-from dump-things-pyclient
- name: Fetch script
run: |
wget https://hub.psychoinformatics.de/orinoco/knowledge-enrichment/raw/branch/main/.forgejo/tools/enrich-via-doi.py
- name: Pre-fetch data
run: |
mkdir .cache
dtc get-records $DUMPTHINGS_APIURL public -C $PERSON_CLASS > .cache/Person.jsonl
dtc get-records $DUMPTHINGS_APIURL public -C $RULE_CLASS > .cache/Rule.jsonl
- name: Process records
run: |
export INBOX_LABEL=${{ inputs.inbox }}
export PIDS=${{ inputs.pids }}
if [ -n "$PIDS" ]
then
IFS=',' read -ra PID_ARRAY <<< $PIDS
for pid in ${PID_ARRAY[@]}
do
dtc get-records $DUMPTHINGS_APIURL $DUMPTHINGS_COLLECTION --pid $pid ${INBOX_LABEL:+--incoming $INBOX_LABEL} |
uv run enrich-via-doi.py --persons .cache/Person.jsonl --rules .cache/Rule.jsonl - - |
dtc post-records $DUMPTHINGS_APIURL $DUMPTHINGS_COLLECTION $PUBLICATION_CLASS
done
else
dtc get-records $DUMPTHINGS_APIURL $DUMPTHINGS_COLLECTION --class $PUBLICATION_CLASS ${INBOX_LABEL:+--incoming $INBOX_LABEL} |
uv run enrich-via-doi.py --persons .cache/Person.jsonl --rules .cache/Rule.jsonl - - |
dtc post-records $DUMPTHINGS_APIURL $DUMPTHINGS_COLLECTION $PUBLICATION_CLASS
fi