- Python 100%
|
|
||
|---|---|---|
| convert | ||
| queries | ||
| recipes | ||
| triple_tools | ||
| .gitignore | ||
| pyproject.toml | ||
| README.md | ||
Triple Tools
A collection of small command line tools for dumpthings-curation and search support.
The tools are in an early state and not automatically tested. Do not use them on collections with valuable data unless you have a backup or are very brave or reckless.
Installation
Perform the following operations, preferably in a Python-virtual environment.
> git clone https://hub.psychoinformatics.de/cmo/triple-tools.git
> cd triple_tools
> pip install .
The commands
This project provided the following CLI commands:
- auto-curate: automatically move records from inboxes to the curated area of a collection
- clean-incoming: delete all records from an inbox of a collection
- list-incoming: list records in inboxes of a collection
- post-records: read records from stdin and post them to inbox or curated area of a collection
- read-pages: read records from collection, curated area of a collection, or specific inboxes
- read-paginated-url: read records from any paginated service endpoints
- build-local-triple-store: read all records from a collection and emit N-Triples
The following section show the help message for those commands
read-pages
Read all pages from a paginated endpoint.
usage: read-pages [-h] [-c CLASS_NAME] [-f FORMAT] [-p PID] [-i LABEL] [-C] [-m MATCHING] [-s PAGE_SIZE] [-F FIRST_PAGE] [-l LAST_PAGE] [--stats] [-P] service_url collection
Get records from a collection on a dump-things-service
This command lists records that are stored in a dump-things-service. By
default all records that are readable with the given token, or the default
token, will be displayed. The output format is JSONL (JSON lines), where
every line contains a record or a record with paging information. If `ttl`
is chosen as format of the output records, the record content will be a string
that contains a TTL-documents.
The command supports to read from the curated area only, to read from incoming
areas, or to read records with a given PID.
Pagination information is returned for paginated results, when requested with
`-P/--pagination`. All results are paginated except "get a record with a given PID"
and "get the list of incoming zone labels".
If the environment variable "DUMPTHINGS_TOKEN" is set, its content will be used
as token to authenticate against the dump-things-service.
positional arguments:
service_url
collection
options:
-h, --help show this help message and exit
-c, --class CLASS_NAME
only read records of this class, ignored if "--pid" is provided
-f, --format FORMAT format of the output records ("json" or "ttl")
-p, --pid PID the pid of the record that should be read
-i, --incoming LABEL read from incoming area with the given label in the collection, if LABEL is "-", return the labels
-C, --curated read from the curated area of the collection
-m, --matching MATCHING
return only records that have a matching value (use {'option_strings': ['-m', '--matching'], 'dest': 'matching', 'nargs': None, 'const': None, 'default': None, 'type': None, 'choices': None,
'required': False, 'help': 'return only records that have a matching value (use % as wildcard). Ignored if "--pid" is provided. (NOTE: not all endpoints and backends support matching.)', 'metavar':
None, 'deprecated': False, 'container': <argparse._ArgumentGroup object at 0x7fab8219b610>, 'prog': 'read-pages'}s wildcard). Ignored if "--pid" is provided. (NOTE: not all endpoints and backends
support matching.)
-s, --page-size PAGE_SIZE
set the page size (1 - 100) (default: 100), ignored if "--pid" is provided
-F, --first-page FIRST_PAGE
the first page to return (default: 1), ignored if "--pid" is provided
-l, --last-page LAST_PAGE
the last page to return (default: None (return all pages), ignored if "--pid" is provided
--stats show the number of records and pages and exit, ignored if "--pid" is provided
-P, --pagination show pagination information (each record from an paginated endpoint is returned as [<record>, <current page number>, <total number of pages>, <page size>, <total number of items>]
For a given <base_url> and <collection> the tool will read all pages
returned by <base_url>/<collection>/records/p/, or the respective inbox or the curated area.
The tool reads a token from the environment variable DUMPTHINGS_TOKEN if set.
post-records
Post records from JSON lines read from STDIN.
usage: post-records [-h] [--curated] base_url collection class
positional arguments:
base_url
collection
class
options:
-h, --help show this help message and exit
--curated bypass inbox, requires curator token
For a given <base_url>, <collection>, and <class> the tool will
read any JSON lines records from STDIN and post them to
<base_url>/<collection>/(curated/)record/<class>.
The tool reads a token from the environment variable DUMPTHINGS_TOKEN.
auto-curate
Move records from inboxes into the curated part of a collection.
usage: auto-curate [-h] [--destination-service-url DEST_SERVICE_URL] [--destination-collection DEST_COLLECTION] [--destination-token DEST_TOKEN] [-e EXCLUDE] [-l] [-r] [-o] [-p PID] SOURCE_SERVICE_URL SOURCE_COLLECTION
Automatically move records from the incoming areas of a
collection to the curated area of the same collection, or to
the curated area of another collection.
The environment variable "DUMPTHINGS_TOKEN" must contain a token
which used to authenticate the requests. The token must have
curator-rights.
positional arguments:
SOURCE_SERVICE_URL
SOURCE_COLLECTION
options:
-h, --help show this help message and exit
--destination-service-url DEST_SERVICE_URL
select a different dump-thing-service, i.e. not SOURCE_SERVICE_URL, as destination for auto-curated records
--destination-collection DEST_COLLECTION
select a different collection, i.e. not the SOURCE_COLLECTION of SOURCE_SERVICE_URL, as destination for auto-curated records
--destination-token DEST_TOKEN
if provided, this token will be used for the destination service, otherwise $DUMPTHINGS_TOKEN will be used
-e, --exclude EXCLUDE
exclude an inbox on the source collection (repeatable)
-l, --list-labels list the inbox labels of the given source collection, do not perform any curation
-r, --list-records list records in the inboxes of the given source collection, do not perform any curation
-o, --list-only [DEPRECATED: use "--list-records"] list records in the inboxes of the given source collection, do not perform any curation
-p, --pid PID if provided, process only records that match the given PIDs
auto-curate requires that the environment variable DUMPTHINGS_TOKEN is set, and contains a valid curator-token.
build-local-triple-store
Get all records from a collection and emit N-Triples (which can be used by qlever)
usage: build-local-triple-store [-h] schema base_url collection
positional arguments:
schema
base_url
collection
options:
-h, --help show this help message and exit
The tool reads a token from the environment variable DUMPTHINGS_TOKEN if set.
Note: the tool requires a schema location because it performs json->ttl conversion locally. That ensures that all records can be read from the server, even if some cannot be converted to ttl. (If ttl-format is requested, the server would not return any record of a page, if at least one record cannot be converted from json to ttl.)
clean-incoming
Delete all records from a given inbox of a given collection
usage: clean-incoming [-h] [--list-only] base_url collection label
positional arguments:
base_url
collection
label
options:
-h, --help show this help message and exit
--list-only, -l list records in the inbox, don't remove them
clean-incoming requires that the environment variable CURATOR_TOKEN is set, and contains a valid curator-token
list-incoming
List the labels of all inboxes of a given collection
usage: list-incoming [-h] [-s] base_url collection
positional arguments:
base_url
collection
options:
-h, --help show this help message and exit
-s, --show-records show the records in the inboxes as well
list-incoming requires that the environment variable CURATOR_TOKEN is set, and contains a valid curator-token.
json2ttl
Convert a stream of JSON lines into a stream of TTL lines, i.e., strings that contain TTL-documents with one string per line.
usage: json2ttl [-h] schema
Read JSON records from stdin and convert them to TTL
This command reads one record per line, either JSON format or a JSON-string
with a TTL-document from stdin, converts them to TTL or JSON and prints them
to stdout.
positional arguments:
schema URL of the schema that should be used
options:
-h, --help show this help message and exit
This can be used, for example, together with read-pages to convert all
records in a collection to TTL:
> read-pages 'https://pool.v0.edu.datalad.org/api' 'public'|json2ttl 'https://concepts.datalad.org/s/demo-research-assets/unreleased.yaml'
"@prefix ISSN: <http://identifiers.org/issn/> .\n@prefix bibo: <http://purl.org/ontology/bibo/> .\n@prefix dlcommonmx: <https://concepts.datalad.org/s/common-mixin/unreleased/> .\n@prefix dlrelationsmx: <https://concepts.datalad.org/s/relations-mixin/unreleased/> .\n@prefix xyzra: <https://concepts.datalad.org/s/demo-research-assets/unreleased/> .\n\nISSN:2475-9066 a xyzra:XYZPublicationVenue ;\n dlcommonmx:title \"Journal of Open Source Software\" ;\n dlrelationsmx:kind bibo:Journal .\n\n"
...
read-paginated-url
General tool to read from any paginated endpoint of a dump-things-service
usage: read-paginated-url [-h] [-s PAGE_SIZE] [-F FIRST_PAGE] [-l LAST_PAGE] [--stats] [-f FORMAT] [-m MATCHING] [-p] url
Read paginated endpoint
This command lists all records that are available via paginated endpoints from
a dump-things-service, e.g., from:
https://<service-location>/<collection>/records/p/
If the environment variable "DUMPTHINGS_TOKEN" is set, its content will be used
as token to authenticate against the dump-things-service.
positional arguments:
url url of the paginated endpoint of the dump-things-service
options:
-h, --help show this help message and exit
-s, --page-size PAGE_SIZE
set the page size (1 - 100) (default: 100)
-F, --first-page FIRST_PAGE
the first page to return (default: 1)
-l, --last-page LAST_PAGE
the last page to return (default: None (return all pages)
--stats show information about the number of records and pages and exit, the format is is returned as [<total number of pages>, <page size>, <total number of items>]
-f, --format FORMAT format of the output records ("json" or "ttl"). (NOTE: not all endpoints support the format parameter.)
-m, --matching MATCHING
return only records that have a matching value (use % as wildcard). (NOTE: not all endpoints and backends support matching.)
-p, --pagination show pagination information (each record from an paginated endpoint is returned as [<record>, <current page number>, <total number of pages>, <page size>, <total number of items>]
read-paginated-url reads a token from the environment variable DUMPTHINGS_TOKEN if it is set.
SPARQL search over a collection with qlever
The provide SPARQL search for a collection the following steps are necessary:
- Create N-Triple representation of the records of the store
- Build a qlever index
- Start the qlever server
- Use qlever query to send SPARQL queries to the server
The following commands will perform those steps
Set the environment variable DATADIR to the name of a directory that should contain your search index,
create and prepare the directory. Set the environment variable REPOROOT to the directory of the local triple tools git repository.
> export DATADIR=<directory where the search index should live>
> mkdir -p ${DATADIR}
> cp ${REPOROOT}/recipes/Qleverfile ${DATADIR}
Download records and create the N-Triples store. The command lines below build the
search index for the collection https://pool.psychoinformatics.de/api/protected
> export DUMPTHINGS_TOKEN=<token>
> build-local-triple-store 'https://concepts.datalad.org/s/demo-rse-group/unreleased.yaml' 'https://pool.psychoinformatics.de/api' protected >${DATADIR}/knowledgebase.nt
The following command lines require docker-support.
Build the search index and start the server
> cd ${DATADIR}
> qlever index --overwrite-existing
> qlever start --kill-existing-with-same-port
Perform a query (there are a number of premade queries for the collection https://pool.psychoinformatics.de/api/protected
in ${REPOROOT}/queries/q*.sparql):
> qlever query "$(cat ${REPOROOT}/queries/q0.sparql)"