Tools that work with dumpthings and triple stores. These support the INM-retreat workshop on linked data

Python 100%

Find a file

cmo 98d589723a Merge pull request 'Use `dump_things_pyclient` to implement `triple-tools`' (#3 ) from use-pyclient into main Reviewed-on: cmo/triple-tools#3		2025-12-11 19:40:23 +00:00
convert	add tools to read pages and convert into nt	2025-11-26 03:37:07 +01:00
queries	improve q5 output	2025-11-27 17:50:29 +01:00
recipes	update README.md and Qleverfile-template	2025-11-30 16:38:53 +01:00
triple_tools	bump version to 0.2.3	2025-12-11 20:38:38 +01:00
.gitignore	add a .gitignore file	2025-11-23 13:15:24 +01:00
pyproject.toml	add read_paginated_url-command	2025-12-11 19:00:59 +01:00
README.md	update README.md	2025-12-11 20:27:09 +01:00

README.md

Triple Tools

A collection of small command line tools for dumpthings-curation and search support.

The tools are in an early state and not automatically tested. Do not use them on collections with valuable data unless you have a backup or are very brave or reckless.

Installation

Perform the following operations, preferably in a Python-virtual environment.

> git clone https://hub.psychoinformatics.de/cmo/triple-tools.git
> cd triple_tools
> pip install .

The commands

This project provided the following CLI commands:

auto-curate: automatically move records from inboxes to the curated area of a collection
clean-incoming: delete all records from an inbox of a collection
list-incoming: list records in inboxes of a collection
post-records: read records from stdin and post them to inbox or curated area of a collection
read-pages: read records from collection, curated area of a collection, or specific inboxes
read-paginated-url: read records from any paginated service endpoints
build-local-triple-store: read all records from a collection and emit N-Triples

The following section show the help message for those commands

read-pages

Read all pages from a paginated endpoint.

usage: read-pages [-h] [-c CLASS_NAME] [-f FORMAT] [-p PID] [-i LABEL] [-C] [-m MATCHING] [-s PAGE_SIZE] [-F FIRST_PAGE] [-l LAST_PAGE] [--stats] [-P] service_url collection

Get records from a collection on a dump-things-service

This command lists records that are stored in a dump-things-service. By
default all records that are readable with the given token, or the default
token, will be displayed. The output format is JSONL (JSON lines), where
every line contains a record or a record with paging information.  If `ttl`
is chosen as format of the output records, the record content will be a string
that contains a TTL-documents.

The command supports to read from the curated area only, to read from incoming
areas, or to read records with a given PID.

Pagination information is returned for paginated results, when requested with
`-P/--pagination`. All results are paginated except "get a record with a given PID"
 and "get the list of incoming zone labels".

If the environment variable "DUMPTHINGS_TOKEN" is set, its content will be used
as token to authenticate against the dump-things-service.

positional arguments:
  service_url
  collection

options:
  -h, --help            show this help message and exit
  -c, --class CLASS_NAME
                        only read records of this class, ignored if "--pid" is provided
  -f, --format FORMAT   format of the output records ("json" or "ttl")
  -p, --pid PID         the pid of the record that should be read
  -i, --incoming LABEL  read from incoming area with the given label in the collection, if LABEL is "-", return the labels
  -C, --curated         read from the curated area of the collection
  -m, --matching MATCHING
                        return only records that have a matching value (use {'option_strings': ['-m', '--matching'], 'dest': 'matching', 'nargs': None, 'const': None, 'default': None, 'type': None, 'choices': None,
                        'required': False, 'help': 'return only records that have a matching value (use % as wildcard). Ignored if "--pid" is provided. (NOTE: not all endpoints and backends support matching.)', 'metavar':
                        None, 'deprecated': False, 'container': <argparse._ArgumentGroup object at 0x7fab8219b610>, 'prog': 'read-pages'}s wildcard). Ignored if "--pid" is provided. (NOTE: not all endpoints and backends
                        support matching.)
  -s, --page-size PAGE_SIZE
                        set the page size (1 - 100) (default: 100), ignored if "--pid" is provided
  -F, --first-page FIRST_PAGE
                        the first page to return (default: 1), ignored if "--pid" is provided
  -l, --last-page LAST_PAGE
                        the last page to return (default: None (return all pages), ignored if "--pid" is provided
  --stats               show the number of records and pages and exit, ignored if "--pid" is provided
  -P, --pagination      show pagination information (each record from an paginated endpoint is returned as [<record>, <current page number>, <total number of pages>, <page size>, <total number of items>]

For a given <base_url> and <collection> the tool will read all pages returned by <base_url>/<collection>/records/p/, or the respective inbox or the curated area.

The tool reads a token from the environment variable DUMPTHINGS_TOKEN if set.

post-records

Post records from JSON lines read from STDIN.

usage: post-records [-h] [--curated] base_url collection class

positional arguments:
  base_url
  collection
  class

options:
  -h, --help  show this help message and exit
  --curated   bypass inbox, requires curator token

For a given <base_url>, <collection>, and <class> the tool will read any JSON lines records from STDIN and post them to <base_url>/<collection>/(curated/)record/<class>.

The tool reads a token from the environment variable DUMPTHINGS_TOKEN.

auto-curate

Move records from inboxes into the curated part of a collection.

usage: auto-curate [-h] [--destination-service-url DEST_SERVICE_URL] [--destination-collection DEST_COLLECTION] [--destination-token DEST_TOKEN] [-e EXCLUDE] [-l] [-r] [-o] [-p PID] SOURCE_SERVICE_URL SOURCE_COLLECTION

Automatically move records from the incoming areas of a
collection to the curated area of the same collection, or to
the curated area of another collection.

The environment variable "DUMPTHINGS_TOKEN" must contain a token
which used to authenticate the requests. The token must have
curator-rights.

positional arguments:
  SOURCE_SERVICE_URL
  SOURCE_COLLECTION

options:
  -h, --help            show this help message and exit
  --destination-service-url DEST_SERVICE_URL
                        select a different dump-thing-service, i.e. not SOURCE_SERVICE_URL, as destination for auto-curated records
  --destination-collection DEST_COLLECTION
                        select a different collection, i.e. not the SOURCE_COLLECTION of SOURCE_SERVICE_URL, as destination for auto-curated records
  --destination-token DEST_TOKEN
                        if provided, this token will be used for the destination service, otherwise $DUMPTHINGS_TOKEN will be used
  -e, --exclude EXCLUDE
                        exclude an inbox on the source collection (repeatable)
  -l, --list-labels     list the inbox labels of the given source collection, do not perform any curation
  -r, --list-records    list records in the inboxes of the given source collection, do not perform any curation
  -o, --list-only       [DEPRECATED: use "--list-records"] list records in the inboxes of the given source collection, do not perform any curation
  -p, --pid PID         if provided, process only records that match the given PIDs

auto-curate requires that the environment variable DUMPTHINGS_TOKEN is set, and contains a valid curator-token.

build-local-triple-store

Get all records from a collection and emit N-Triples (which can be used by qlever)

usage: build-local-triple-store [-h] schema base_url collection

positional arguments:
  schema
  base_url
  collection

options:
  -h, --help  show this help message and exit

The tool reads a token from the environment variable DUMPTHINGS_TOKEN if set.

Note: the tool requires a schema location because it performs json->ttl conversion locally. That ensures that all records can be read from the server, even if some cannot be converted to ttl. (If ttl-format is requested, the server would not return any record of a page, if at least one record cannot be converted from json to ttl.)

clean-incoming

Delete all records from a given inbox of a given collection

usage: clean-incoming [-h] [--list-only] base_url collection label

positional arguments:
  base_url
  collection
  label

options:
  -h, --help       show this help message and exit
  --list-only, -l  list records in the inbox, don't remove them

clean-incoming requires that the environment variable CURATOR_TOKEN is set, and contains a valid curator-token

list-incoming

List the labels of all inboxes of a given collection

usage: list-incoming [-h] [-s] base_url collection

positional arguments:
  base_url
  collection

options:
  -h, --help          show this help message and exit
  -s, --show-records  show the records in the inboxes as well

list-incoming requires that the environment variable CURATOR_TOKEN is set, and contains a valid curator-token.

json2ttl

Convert a stream of JSON lines into a stream of TTL lines, i.e., strings that contain TTL-documents with one string per line.

usage: json2ttl [-h] schema

Read JSON records from stdin and convert them to TTL

This command reads one record per line, either JSON format or a JSON-string
with a TTL-document from stdin, converts them to TTL or JSON and prints them
to stdout.

positional arguments:
  schema      URL of the schema that should be used

options:
  -h, --help  show this help message and exit

This can be used, for example, together with read-pages to convert all records in a collection to TTL:

> read-pages 'https://pool.v0.edu.datalad.org/api' 'public'|json2ttl 'https://concepts.datalad.org/s/demo-research-assets/unreleased.yaml'
"@prefix ISSN: <http://identifiers.org/issn/> .\n@prefix bibo: <http://purl.org/ontology/bibo/> .\n@prefix dlcommonmx: <https://concepts.datalad.org/s/common-mixin/unreleased/> .\n@prefix dlrelationsmx: <https://concepts.datalad.org/s/relations-mixin/unreleased/> .\n@prefix xyzra: <https://concepts.datalad.org/s/demo-research-assets/unreleased/> .\n\nISSN:2475-9066 a xyzra:XYZPublicationVenue ;\n    dlcommonmx:title \"Journal of Open Source Software\" ;\n    dlrelationsmx:kind bibo:Journal .\n\n"
...

read-paginated-url

General tool to read from any paginated endpoint of a dump-things-service

usage: read-paginated-url [-h] [-s PAGE_SIZE] [-F FIRST_PAGE] [-l LAST_PAGE] [--stats] [-f FORMAT] [-m MATCHING] [-p] url

Read paginated endpoint

This command lists all records that are available via paginated endpoints from
a dump-things-service, e.g., from:
  
  https://<service-location>/<collection>/records/p/

If the environment variable "DUMPTHINGS_TOKEN" is set, its content will be used
as token to authenticate against the dump-things-service.

positional arguments:
  url                   url of the paginated endpoint of the dump-things-service

options:
  -h, --help            show this help message and exit
  -s, --page-size PAGE_SIZE
                        set the page size (1 - 100) (default: 100)
  -F, --first-page FIRST_PAGE
                        the first page to return (default: 1)
  -l, --last-page LAST_PAGE
                        the last page to return (default: None (return all pages)
  --stats               show information about the number of records and pages and exit, the format is is returned as [<total number of pages>, <page size>, <total number of items>]
  -f, --format FORMAT   format of the output records ("json" or "ttl"). (NOTE: not all endpoints support the format parameter.)
  -m, --matching MATCHING
                        return only records that have a matching value (use % as wildcard). (NOTE: not all endpoints and backends support matching.)
  -p, --pagination      show pagination information (each record from an paginated endpoint is returned as [<record>, <current page number>, <total number of pages>, <page size>, <total number of items>]

read-paginated-url reads a token from the environment variable DUMPTHINGS_TOKEN if it is set.

SPARQL search over a collection with qlever

The provide SPARQL search for a collection the following steps are necessary:

Create N-Triple representation of the records of the store
Build a qlever index
Start the qlever server
Use qlever query to send SPARQL queries to the server

The following commands will perform those steps

Set the environment variable DATADIR to the name of a directory that should contain your search index, create and prepare the directory. Set the environment variable REPOROOT to the directory of the local triple tools git repository.

> export DATADIR=<directory where the search index should live>
> mkdir -p ${DATADIR}
> cp ${REPOROOT}/recipes/Qleverfile ${DATADIR}

Download records and create the N-Triples store. The command lines below build the search index for the collection https://pool.psychoinformatics.de/api/protected

> export DUMPTHINGS_TOKEN=<token>
> build-local-triple-store 'https://concepts.datalad.org/s/demo-rse-group/unreleased.yaml' 'https://pool.psychoinformatics.de/api' protected >${DATADIR}/knowledgebase.nt

The following command lines require docker-support.

Build the search index and start the server

> cd ${DATADIR}
> qlever index --overwrite-existing
> qlever start --kill-existing-with-same-port

Perform a query (there are a number of premade queries for the collection https://pool.psychoinformatics.de/api/protected in ${REPOROOT}/queries/q*.sparql):

> qlever query "$(cat ${REPOROOT}/queries/q0.sparql)"