Use dump_things_pyclient to implement triple-tools #3

Merged
cmo merged 20 commits from use-pyclient into main 2025-12-11 19:40:24 +00:00
12 changed files with 610 additions and 280 deletions

141
README.md
View file

@ -19,27 +19,73 @@ Perform the following operations, preferably in a Python-virtual environment.
## The commands ## The commands
This project provided the following CLI commands:
- auto-curate: automatically move records from inboxes to the curated area of a collection
- clean-incoming: delete all records from an inbox of a collection
- list-incoming: list records in inboxes of a collection
- post-records: read records from stdin and post them to inbox or curated area of a collection
- read-pages: read records from collection, curated area of a collection, or specific inboxes
- read-paginated-url: read records from any paginated service endpoints
- build-local-triple-store: read all records from a collection and emit N-Triples
The following section show the help message for those commands
#### read-pages #### read-pages
Read all pages from a paginated endpoint. Read all pages from a paginated endpoint.
``` ```
usage: read_pages [-h] [-s SIZE] [-p PARAMETER] base_url collection usage: read-pages [-h] [-c CLASS_NAME] [-f FORMAT] [-p PID] [-i LABEL] [-C] [-m MATCHING] [-s PAGE_SIZE] [-F FIRST_PAGE] [-l LAST_PAGE] [--stats] [-P] service_url collection
Get records from a collection on a dump-things-service
This command lists records that are stored in a dump-things-service. By
default all records that are readable with the given token, or the default
token, will be displayed. The output format is JSONL (JSON lines), where
every line contains a record or a record with paging information. If `ttl`
is chosen as format of the output records, the record content will be a string
that contains a TTL-documents.
The command supports to read from the curated area only, to read from incoming
areas, or to read records with a given PID.
Pagination information is returned for paginated results, when requested with
`-P/--pagination`. All results are paginated except "get a record with a given PID"
and "get the list of incoming zone labels".
If the environment variable "DUMPTHINGS_TOKEN" is set, its content will be used
as token to authenticate against the dump-things-service.
positional arguments: positional arguments:
base_url service_url
collection collection
options: options:
-h, --help show this help message and exit -h, --help show this help message and exit
-s, --size SIZE default: 100 -c, --class CLASS_NAME
-p, --parameter PARAMETER (key=value) only read records of this class, ignored if "--pid" is provided
-c, --class limit to a particular class (name) -f, --format FORMAT format of the output records ("json" or "ttl")
-p, --pid PID the pid of the record that should be read
-i, --incoming LABEL read from incoming area with the given label in the collection, if LABEL is "-", return the labels
-C, --curated read from the curated area of the collection
-m, --matching MATCHING
return only records that have a matching value (use {'option_strings': ['-m', '--matching'], 'dest': 'matching', 'nargs': None, 'const': None, 'default': None, 'type': None, 'choices': None,
'required': False, 'help': 'return only records that have a matching value (use % as wildcard). Ignored if "--pid" is provided. (NOTE: not all endpoints and backends support matching.)', 'metavar':
None, 'deprecated': False, 'container': <argparse._ArgumentGroup object at 0x7fab8219b610>, 'prog': 'read-pages'}s wildcard). Ignored if "--pid" is provided. (NOTE: not all endpoints and backends
support matching.)
-s, --page-size PAGE_SIZE
set the page size (1 - 100) (default: 100), ignored if "--pid" is provided
-F, --first-page FIRST_PAGE
the first page to return (default: 1), ignored if "--pid" is provided
-l, --last-page LAST_PAGE
the last page to return (default: None (return all pages), ignored if "--pid" is provided
--stats show the number of records and pages and exit, ignored if "--pid" is provided
-P, --pagination show pagination information (each record from an paginated endpoint is returned as [<record>, <current page number>, <total number of pages>, <page size>, <total number of items>]
``` ```
For a given `<base_url>` and `<collection>` the tool will read all pages For a given `<base_url>` and `<collection>` the tool will read all pages
returned by `<base_url>/<collection>/records/p/`. returned by `<base_url>/<collection>/records/p/`, or the respective inbox or the curated area.
The tool reads a token from the environment variable `DUMPTHINGS_TOKEN` if set. The tool reads a token from the environment variable `DUMPTHINGS_TOKEN` if set.
@ -73,10 +119,15 @@ The tool reads a token from the environment variable `DUMPTHINGS_TOKEN`.
Move records from inboxes into the curated part of a collection. Move records from inboxes into the curated part of a collection.
``` ```
usage: auto_curate [-h] [--destination-base-url DEST_SERVICE_URL] [--destination-collection DEST_COLLECTION] [--destination-token DEST_TOKEN] [--exclude [EXCLUDE ...]] [--list-labels] [--list-only] [-p PID] usage: auto-curate [-h] [--destination-service-url DEST_SERVICE_URL] [--destination-collection DEST_COLLECTION] [--destination-token DEST_TOKEN] [-e EXCLUDE] [-l] [-r] [-o] [-p PID] SOURCE_SERVICE_URL SOURCE_COLLECTION
SOURCE_SERVICE_URL SOURCE_COLLECTION
Automatically move records from the incoming areas of a collection to the curated area of the same collection, or to the incoming area of another collection. Automatically move records from the incoming areas of a
collection to the curated area of the same collection, or to
the curated area of another collection.
The environment variable "DUMPTHINGS_TOKEN" must contain a token
which used to authenticate the requests. The token must have
curator-rights.
positional arguments: positional arguments:
SOURCE_SERVICE_URL SOURCE_SERVICE_URL
@ -84,21 +135,21 @@ positional arguments:
options: options:
-h, --help show this help message and exit -h, --help show this help message and exit
--destination-base-url DEST_SERVICE_URL --destination-service-url DEST_SERVICE_URL
select a different dump-thing-service, i.e. not SOURCE_SERVICE_URL, as destination for auto-curated records select a different dump-thing-service, i.e. not SOURCE_SERVICE_URL, as destination for auto-curated records
--destination-collection DEST_COLLECTION --destination-collection DEST_COLLECTION
select a different collection, i.e. not the SOURCE_COLLECTION of SOURCE_SERVICE_URL, as destination for auto-curated records select a different collection, i.e. not the SOURCE_COLLECTION of SOURCE_SERVICE_URL, as destination for auto-curated records
--destination-token DEST_TOKEN --destination-token DEST_TOKEN
if provided, this token will be used for the destination service, otherwise ${CURATOR_TOKEN} will be used if provided, this token will be used for the destination service, otherwise $DUMPTHINGS_TOKEN will be used
--exclude, -e [EXCLUDE ...] -e, --exclude EXCLUDE
exclude an inbox on the source collection exclude an inbox on the source collection (repeatable)
--list-labels, -l -l, --list-labels list the inbox labels of the given source collection, do not perform any curation
--list-only, -o -r, --list-records list records in the inboxes of the given source collection, do not perform any curation
-p, --pid PID if provided, process only records that match the given PIDs. NOTE: matching does not involve CURIE-resolution! -o, --list-only [DEPRECATED: use "--list-records"] list records in the inboxes of the given source collection, do not perform any curation
-p, --pid PID if provided, process only records that match the given PIDs
``` ```
`auto-curate` requires that the environment variable `CURATOR_TOKEN` is set, and contains a valid curator-token. `auto-curate` requires that the environment variable DUMPTHINGS_TOKEN is set, and contains a valid curator-token.
#### build-local-triple-store #### build-local-triple-store
@ -149,7 +200,7 @@ options:
List the labels of all inboxes of a given collection List the labels of all inboxes of a given collection
``` ```
usage: list-incoming [-h] [--show-records] base_url collection usage: list-incoming [-h] [-s] base_url collection
positional arguments: positional arguments:
base_url base_url
@ -157,10 +208,10 @@ positional arguments:
options: options:
-h, --help show this help message and exit -h, --help show this help message and exit
--show-records, -s show the records in the inboxes as well -s, --show-records show the records in the inboxes as well
``` ```
`list-incoming` requires that the environment variable `CURATOR_TOKEN` is set, and contains a valid curator-token `list-incoming` requires that the environment variable `CURATOR_TOKEN` is set, and contains a valid curator-token.
#### json2ttl #### json2ttl
@ -171,8 +222,14 @@ contain TTL-documents with one string per line.
``` ```
usage: json2ttl [-h] schema usage: json2ttl [-h] schema
Read JSON records from stdin and convert them to TTL
This command reads one record per line, either JSON format or a JSON-string
with a TTL-document from stdin, converts them to TTL or JSON and prints them
to stdout.
positional arguments: positional arguments:
schema schema URL of the schema that should be used
options: options:
-h, --help show this help message and exit -h, --help show this help message and exit
@ -187,6 +244,44 @@ records in a collection to TTL:
... ...
``` ```
#### read-paginated-url
General tool to read from any paginated endpoint of a dump-things-service
```
usage: read-paginated-url [-h] [-s PAGE_SIZE] [-F FIRST_PAGE] [-l LAST_PAGE] [--stats] [-f FORMAT] [-m MATCHING] [-p] url
Read paginated endpoint
This command lists all records that are available via paginated endpoints from
a dump-things-service, e.g., from:
https://<service-location>/<collection>/records/p/
If the environment variable "DUMPTHINGS_TOKEN" is set, its content will be used
as token to authenticate against the dump-things-service.
positional arguments:
url url of the paginated endpoint of the dump-things-service
options:
-h, --help show this help message and exit
-s, --page-size PAGE_SIZE
set the page size (1 - 100) (default: 100)
-F, --first-page FIRST_PAGE
the first page to return (default: 1)
-l, --last-page LAST_PAGE
the last page to return (default: None (return all pages)
--stats show information about the number of records and pages and exit, the format is is returned as [<total number of pages>, <page size>, <total number of items>]
-f, --format FORMAT format of the output records ("json" or "ttl"). (NOTE: not all endpoints support the format parameter.)
-m, --matching MATCHING
return only records that have a matching value (use % as wildcard). (NOTE: not all endpoints and backends support matching.)
-p, --pagination show pagination information (each record from an paginated endpoint is returned as [<record>, <current page number>, <total number of pages>, <page size>, <total number of items>]
```
`read-paginated-url` reads a token from the environment variable `DUMPTHINGS_TOKEN` if it is set.
## SPARQL search over a collection with qlever ## SPARQL search over a collection with qlever
The provide SPARQL search for a collection the following steps are necessary: The provide SPARQL search for a collection the following steps are necessary:
@ -194,7 +289,7 @@ The provide SPARQL search for a collection the following steps are necessary:
1. Create N-Triple representation of the records of the store 1. Create N-Triple representation of the records of the store
2. Build a qlever index 2. Build a qlever index
3. Start the qlever server 3. Start the qlever server
4. Use alever query to send SPARQL queries to the server 4. Use qlever query to send SPARQL queries to the server
---- ----

View file

@ -24,6 +24,7 @@ classifiers = [
"Programming Language :: Python :: Implementation :: PyPy", "Programming Language :: Python :: Implementation :: PyPy",
] ]
dependencies = [ dependencies = [
"dump-things-pyclient",
"dump-things-service", "dump-things-service",
"progress", "progress",
"qlever", "qlever",
@ -44,6 +45,7 @@ list-incoming = "triple_tools.list_incoming:main"
post-records = "triple_tools.post_records:main" post-records = "triple_tools.post_records:main"
read-pages = "triple_tools.read_pages:main" read-pages = "triple_tools.read_pages:main"
json2ttl = "triple_tools.json2ttl:main" json2ttl = "triple_tools.json2ttl:main"
read-paginated-url = "triple_tools.read_paginated_url:main"
[tool.hatch.build.targets.wheel] [tool.hatch.build.targets.wheel]
exclude = [ exclude = [

View file

@ -1 +1 @@
__version__ = '0.2.2' __version__ = '0.2.3'

View file

@ -1,33 +1,47 @@
from __future__ import annotations from __future__ import annotations
import argparse import argparse
import json
import logging
import os import os
import re import re
import sys import sys
from urllib.parse import quote_plus
from dump_things_pyclient.communicate import (
from triple_tools.communicate import ( HTTPError,
delete_url, curated_write_record,
get_labels, incoming_delete_record,
get_records_from_label, incoming_read_labels,
post_to_url, incoming_read_records,
) )
def main(): logger = logging.getLogger('auto_curate')
argument_parser = argparse.ArgumentParser(
prog='auto_curate', token_name = 'DUMPTHINGS_TOKEN'
description="""
stl_info = False
description=f"""
Automatically move records from the incoming areas of a Automatically move records from the incoming areas of a
collection to the curated area of the same collection, or to collection to the curated area of the same collection, or to
the incoming area of another collection. the curated area of another collection.
The environment variable "{token_name}" must contain a token
which used to authenticate the requests. The token must have
curator-rights.
""" """
def _main():
argument_parser = argparse.ArgumentParser(
description=description,
formatter_class=argparse.RawDescriptionHelpFormatter,
) )
argument_parser.add_argument('base_url', metavar='SOURCE_SERVICE_URL') argument_parser.add_argument('service_url', metavar='SOURCE_SERVICE_URL')
argument_parser.add_argument('collection', metavar='SOURCE_COLLECTION') argument_parser.add_argument('collection', metavar='SOURCE_COLLECTION')
argument_parser.add_argument( argument_parser.add_argument(
'--destination-base-url', '--destination-service-url',
default=None, default=None,
metavar='DEST_SERVICE_URL', metavar='DEST_SERVICE_URL',
help='select a different dump-thing-service, i.e. not SOURCE_SERVICE_URL, as destination for auto-curated records', help='select a different dump-thing-service, i.e. not SOURCE_SERVICE_URL, as destination for auto-curated records',
@ -42,71 +56,144 @@ def main():
'--destination-token', '--destination-token',
default=None, default=None,
metavar='DEST_TOKEN', metavar='DEST_TOKEN',
help='if provided, this token will be used for the destination service, otherwise ${CURATOR_TOKEN} will be used', help=f'if provided, this token will be used for the destination service, otherwise ${token_name} will be used',
) )
argument_parser.add_argument('--exclude', '-e', nargs='*', default=[], help='exclude an inbox on the source collection')
argument_parser.add_argument('--list-labels', '-l', action='store_true')
argument_parser.add_argument('--list-only', '-o', action='store_true')
argument_parser.add_argument( argument_parser.add_argument(
'-p', '--pid', action='append', '-e', '--exclude',
help='if provided, process only records that match the given PIDs. NOTE: matching does not involve CURIE-resolution!', action='append',
default=[],
help='exclude an inbox on the source collection (repeatable)',
)
argument_parser.add_argument(
'-l', '--list-labels',
action='store_true',
help='list the inbox labels of the given source collection, do not perform any curation',
)
argument_parser.add_argument(
'-r', '--list-records',
action='store_true',
help='list records in the inboxes of the given source collection, do not perform any curation',
)
argument_parser.add_argument(
'-o', '--list-only',
action='store_true',
help='[DEPRECATED: use "--list-records"] list records in the inboxes of the given source collection, do not perform any curation',
)
argument_parser.add_argument(
'-p', '--pid',
action='append',
help='if provided, process only records that match the given PIDs',
) )
arguments = argument_parser.parse_args() arguments = argument_parser.parse_args()
print(arguments)
curator_token = os.environ.get('CURATOR_TOKEN') curator_token = os.environ.get(token_name)
if curator_token is None: if curator_token is None:
print('ERROR: CURATOR_TOKEN not set', file=sys.stderr, flush=True) print(f'ERROR: environment variable "{token_name}" not set', file=sys.stderr, flush=True)
return 1 return 1
destination_url = arguments.destination_base_url or arguments.base_url destination_url = arguments.destination_service_url or arguments.service_url
destination_collection = arguments.destination_collection or arguments.collection destination_collection = arguments.destination_collection or arguments.collection
destination_token = arguments.destination_token or curator_token destination_token = arguments.destination_token or curator_token
for label in get_labels( output = None
url_base=arguments.base_url,
collection=arguments.collection, # If --list-labels and --list-records are provided, keep only the latter,
token=curator_token # because it includes listing of labels
): if arguments.list_records:
if arguments.list_labels: if arguments.list_labels:
print(label) print('WARNING: `-l/--list-labels` and `-r/--list-records` defined, ignoring `-l/--list-labels`', file=sys.stderr, flush=True)
continue arguments.list_labels = False
output = {}
if arguments.list_labels:
output = []
for label in incoming_read_labels(
service_url=arguments.service_url,
collection=arguments.collection,
token=curator_token):
if label in arguments.exclude: if label in arguments.exclude:
logger.debug('ignoring excluded incoming label: %s', label)
continue continue
for record in get_records_from_label( if arguments.list_labels:
url_base=arguments.base_url, output.append(label)
continue
if arguments.list_records:
output[label] = []
for record, _, _, _, _ in incoming_read_records(
service_url=arguments.service_url,
collection=arguments.collection, collection=arguments.collection,
label=label, label=label,
token=curator_token token=curator_token):
):
if arguments.pid: if arguments.pid:
if record['pid'] not in arguments.pid: if record['pid'] not in arguments.pid:
logger.debug(
'ignoring record with non-matching pid: %s',
record['pid'])
continue continue
if arguments.list_only: if arguments.list_records or arguments.list_only:
print(f'{label}:\t{record}') output[label].append(record)
continue continue
# Get the class name from the `schema_type` attribute. This requires
# that the schema type is either stored in the record or that the
# store has a "Schema Type Layer", i.e., the store type is
# `record_dir+stl`, or `sqlite+stl`.
try:
class_name = re.search('([_A-Za-z0-9]*$)', record['schema_type']).group(0) class_name = re.search('([_A-Za-z0-9]*$)', record['schema_type']).group(0)
# Store record in collection except IndexError:
post_to_url( global stl_info
f'{destination_url}/{destination_collection}/curated/record/{class_name}', if not stl_info:
token=destination_token, print(
content=record, f"""Could not find `schema_type` attribute in record with
) pid {record['pid']}. Please ensure that `schema_type` is stored in
the records or that the associated incoming area store has a backend
with a "Schema Type Layer", i.e., "record_dir+stl" or
"sqlite+stl".""",
file=sys.stderr,
flush=True)
stl_info = True
print(
f'WARNING: ignoring record with pid {record["pid"]}, `schema_type` attribute is missing.',
file=sys.stderr,
flush=True)
continue
# Store record in destination collection
curated_write_record(
service_url=destination_url,
collection=destination_collection,
class_name=class_name,
record=record,
token=destination_token)
# Delete record from incoming area # Delete record from incoming area
url = f'{arguments.base_url}/{arguments.collection}/incoming/{label}/record?pid={quote_plus(record["pid"])}' incoming_delete_record(
delete_url( service_url=arguments.service_url,
url=url, collection=arguments.collection,
label=label,
pid=record['pid'],
token=curator_token, token=curator_token,
) )
if output is not None:
print(json.dumps(output, ensure_ascii=False))
return 0 return 0
def main():
try:
return _main()
except HTTPError as e:
print(f'ERROR: {e}: {e.response.text}', file=sys.stderr, flush=True)
return 1
if __name__ == '__main__': if __name__ == '__main__':
sys.exit(main()) sys.exit(main())

View file

@ -9,10 +9,13 @@ import sys
from dump_things_service.converter import Format, FormatConverter from dump_things_service.converter import Format, FormatConverter
from rdflib import Graph from rdflib import Graph
from triple_tools.communicate import get_all from dump_things_pyclient.communicate import (
HTTPError,
get_paginated,
)
def main(): def _main():
argument_parser = argparse.ArgumentParser() argument_parser = argparse.ArgumentParser()
argument_parser.add_argument('schema') argument_parser.add_argument('schema')
argument_parser.add_argument('base_url') argument_parser.add_argument('base_url')
@ -22,8 +25,7 @@ def main():
token = os.environ.get('DUMPTHINGS_TOKEN') token = os.environ.get('DUMPTHINGS_TOKEN')
if token is None: if token is None:
print('WARNING: DUMPTHINGS_TOKEN not set', file=sys.stderr, flush=True) print('WARNING: environment variable DUMPTHINGS_TOKEN not set', file=sys.stderr, flush=True)
print(f'Creating converter for schema {arguments.schema} ...', file=sys.stderr, end='', flush=True) print(f'Creating converter for schema {arguments.schema} ...', file=sys.stderr, end='', flush=True)
converter = FormatConverter( converter = FormatConverter(
@ -41,7 +43,7 @@ def main():
) )
g = Graph() g = Graph()
for json_object in get_all(url_base, os.environ.get('DUMPTHINGS_TOKEN'), {'size': '100'}, show_progress=True): for json_object in get_paginated(url_base, page_size=100, token=os.environ.get('DUMPTHINGS_TOKEN')):
object_class = json_object.get('schema_type') object_class = json_object.get('schema_type')
if object_class is None: if object_class is None:
raise ValueError(f'No schema_type in {json_object}') raise ValueError(f'No schema_type in {json_object}')
@ -51,7 +53,7 @@ def main():
try: try:
ttl = converter.convert(json_object, class_name) ttl = converter.convert(json_object, class_name)
except ValueError as ve: except ValueError as ve:
print(f'\nWARNING: could not convert record {json_object["pid"]}: {ve}', file=sys.stderr, flush=True) print(f'WARNING: could not convert record {json_object["pid"]}: {ve}', file=sys.stderr, flush=True)
continue continue
g.parse(io.StringIO(ttl), format='n3') g.parse(io.StringIO(ttl), format='n3')
@ -59,5 +61,13 @@ def main():
return 0 return 0
def main():
try:
return _main()
except HTTPError as e:
print(f'ERROR: {e}: {e.response.text}', file=sys.stderr, flush=True)
return 1
if __name__ == '__main__': if __name__ == '__main__':
sys.exit(main()) sys.exit(main())

View file

@ -4,28 +4,29 @@ import argparse
import os import os
import sys import sys
from triple_tools.communicate import ( from dump_things_pyclient.communicate import (
delete_url, HTTPError,
get_records_from_label, incoming_delete_record,
incoming_read_records,
) )
def main(): def _main():
argument_parser = argparse.ArgumentParser() argument_parser = argparse.ArgumentParser()
argument_parser.add_argument('base_url') argument_parser.add_argument('base_url')
argument_parser.add_argument('collection') argument_parser.add_argument('collection')
argument_parser.add_argument('label') argument_parser.add_argument('label')
argument_parser.add_argument('--list-only', '-l', action='store_true') argument_parser.add_argument('--list-only', '-l', action='store_true', help="list records in the inbox, don't remove them")
arguments = argument_parser.parse_args() arguments = argument_parser.parse_args()
curator_token = os.environ.get('CURATOR_TOKEN') curator_token = os.environ.get('CURATOR_TOKEN')
if curator_token is None: if curator_token is None:
print('ERROR: CURATOR_TOKEN not set', file=sys.stderr, flush=True) print('ERROR: environment variable CURATOR_TOKEN not set', file=sys.stderr, flush=True)
return 1 return 1
for record in get_records_from_label( for record, _, _, _, _ in incoming_read_records(
url_base=arguments.base_url, service_url=arguments.base_url,
collection=arguments.collection, collection=arguments.collection,
label=arguments.label, label=arguments.label,
token=curator_token, token=curator_token,
@ -35,13 +36,24 @@ def main():
continue continue
# Delete record from incoming area # Delete record from incoming area
label_url = f'{arguments.base_url}/{arguments.collection}/incoming/{arguments.label}' incoming_delete_record(
delete_url( service_url=arguments.base_url,
url = f'{label_url}/record?pid={record["pid"]}', collection=arguments.collection,
label=arguments.label,
pid=record['pid'],
token=curator_token, token=curator_token,
) )
return 0 return 0
def main():
try:
return _main()
except HTTPError as e:
print(f'ERROR: {e}: {e.response.text}', file=sys.stderr, flush=True)
return 1
if __name__ == '__main__': if __name__ == '__main__':
sys.exit(main()) sys.exit(main())

View file

@ -1,130 +0,0 @@
from __future__ import annotations
from collections.abc import Iterable
from urllib.parse import quote_plus
import requests
from progress.bar import Bar
def _create_url(
url_base: str,
parameters: dict[str, str] | None = None,
page_number: int | None = None,
):
parameters = parameters or {}
parameters.update({'page': str(page_number)})
all_parameters = [f'{k}={quote_plus(v)}' for k, v in parameters.items()]
return url_base + '?' + '&'.join(all_parameters)
def _get_page(
url_base: str,
token: str | None = None,
parameters: Iterable[str] | None = None,
page_number: int | None = None,
):
return get_from_url(_create_url(url_base, parameters, page_number), token)
def get_all(
url_base: str,
token: str | None = None,
parameters: dict[str, str] | None = None,
show_progress: bool = False,
):
# Get the first result and the number of pages
result = _get_page(url_base, token, parameters, page_number=1)
total_pages = result['pages']
if total_pages == 0:
return
if show_progress:
bar = Bar('Pages', max=total_pages, suffix='%(index)d/%(max)d - %(eta_td)s')
yield from result['items']
bar.next()
else:
yield from result['items']
# Get remaining results
for page in range(2, total_pages + 1):
result = _get_page(url_base, token, parameters, page_number=page)
yield from result['items']
if show_progress:
bar.next()
if show_progress:
bar.finish()
def check_result(
result: requests.Response,
method: str,
url: str
):
if not 200 <= result.status_code < 300:
msg = f'HTTP {method} {url} failed: {result.status_code}: {result.text}'
raise RuntimeError(msg)
def get_from_url(
url: str,
token: str,
):
r = requests.get(
url,
headers=({
'x-dumpthings-token': token,
} if token else {}),
)
check_result(r, 'GET', url)
return r.json()
def post_to_url(
url: str,
token: str | None,
content: list | dict
):
r = requests.post(
url,
headers=({
'x-dumpthings-token': token,
} if token else {}),
json=content,
)
check_result(r, 'POST', url)
return r.json()
def delete_url(
url: str,
token: str | None,
):
r = requests.delete(
url,
headers=({
'x-dumpthings-token': token,
} if token else {}),
)
check_result(r, 'DELETE', url)
return r.json()
def get_labels(
url_base: str,
collection: str,
token: str | None = None,
):
yield from get_from_url(f'{url_base}/{collection}/incoming/', token)
def get_records_from_label(
url_base: str,
collection,
label: str,
token: str | None = None,
parameters: dict[str, str] | None = None,
):
label_url = f'{url_base}/{collection}/incoming/{label}/records/p/'
yield from get_all(label_url, token=token, parameters=parameters)

View file

@ -11,9 +11,21 @@ from dump_things_service.converter import (
) )
description = f"""Read JSON records from stdin and convert them to TTL
This command reads one record per line, either JSON format or a JSON-string
with a TTL-document from stdin, converts them to TTL or JSON and prints them
to stdout.
"""
def main(): def main():
argument_parser = argparse.ArgumentParser() argument_parser = argparse.ArgumentParser(
argument_parser.add_argument('schema') description=description,
formatter_class=argparse.RawDescriptionHelpFormatter,
)
argument_parser.add_argument('schema', help='URL of the schema that should be used')
arguments = argument_parser.parse_args() arguments = argument_parser.parse_args()
@ -26,16 +38,16 @@ def main():
print(' done', file=sys.stderr, flush=True) print(' done', file=sys.stderr, flush=True)
error = False error = False
for line in sys.stdin: for line in sys.stdin:
json_object = json.loads(line) json_object = json.loads(line)
object_class = json_object.get('schema_type') object_class = json_object.get('schema_type')
if object_class is None: if object_class is None:
error = True
print(f'ERROR: No schema_type in {json_object}', file=sys.stderr, flush=True) print(f'ERROR: No schema_type in {json_object}', file=sys.stderr, flush=True)
continue continue
class_name = re.search('([_A-Za-z0-9]*$)', object_class).group(0) class_name = re.search('([_A-Za-z0-9]*$)', object_class).group(0)
try: try:
ttl = converter.convert(json_object, class_name) ttl = converter.convert(json_object, class_name)
except ValueError as ve: except ValueError as ve:

View file

@ -1,45 +1,60 @@
from __future__ import annotations from __future__ import annotations
import argparse import argparse
import json
import os import os
import sys import sys
from collections import defaultdict
from triple_tools.communicate import ( from dump_things_pyclient.communicate import (
get_labels, HTTPError,
get_records_from_label, incoming_read_labels,
incoming_read_records,
) )
def main(): def _main():
argument_parser = argparse.ArgumentParser() argument_parser = argparse.ArgumentParser()
argument_parser.add_argument('base_url') argument_parser.add_argument('base_url')
argument_parser.add_argument('collection') argument_parser.add_argument('collection')
argument_parser.add_argument('--show-records', '-s', action='store_true') argument_parser.add_argument('-s', '--show-records', action='store_true', help='show the records in the inboxes as well')
arguments = argument_parser.parse_args() arguments = argument_parser.parse_args()
curator_token = os.environ.get('CURATOR_TOKEN') curator_token = os.environ.get('CURATOR_TOKEN')
if curator_token is None: if curator_token is None:
print('ERROR: CURATOR_TOKEN not set', file=sys.stderr, flush=True) print('ERROR: environment variable CURATOR_TOKEN not set', file=sys.stderr, flush=True)
return 1 return 1
for label in get_labels( result = {}
url_base=arguments.base_url, for label in incoming_read_labels(
service_url=arguments.base_url,
collection=arguments.collection, collection=arguments.collection,
token=curator_token, token=curator_token,
): ):
print(label) result[label] = []
if arguments.show_records: if arguments.show_records:
for record in get_records_from_label( for record, _, _, _, _ in incoming_read_records(
url_base=arguments.base_url, service_url=arguments.base_url,
collection=arguments.collection, collection=arguments.collection,
label=label, label=label,
token=curator_token, token=curator_token,
): ):
print('\t', record) result[label].append(record)
if arguments.show_records is False:
result = list(result)
print(json.dumps(result, indent=2, ensure_ascii=False))
return 0 return 0
def main():
try:
return _main()
except HTTPError as e:
print(f'ERROR: {e}: {e.response.text}', file=sys.stderr, flush=True)
return 1
if __name__ == '__main__': if __name__ == '__main__':
sys.exit(main()) sys.exit(main())

View file

@ -5,42 +5,51 @@ import json
import os import os
import sys import sys
from triple_tools.communicate import post_to_url from dump_things_pyclient.communicate import (
collection_write_record,
curated_write_record,
)
def main(): def main():
argument_parser = argparse.ArgumentParser() argument_parser = argparse.ArgumentParser()
argument_parser.add_argument('base_url') argument_parser.add_argument('base_url')
argument_parser.add_argument('collection') argument_parser.add_argument('collection')
argument_parser.add_argument('cls') argument_parser.add_argument('cls', metavar='class')
argument_parser.add_argument('--curated', action='store_true') argument_parser.add_argument('--curated', action='store_true', help='bypass inbox, requires curator token')
arguments = argument_parser.parse_args() arguments = argument_parser.parse_args()
token = os.environ.get('DUMPTHINGS_TOKEN') token = os.environ.get('DUMPTHINGS_TOKEN')
if token is None: if token is None:
print('WARNING: DUMPTHINGS_TOKEN not set', file=sys.stderr, flush=True) print(
'WARNING: environment variable DUMPTHINGS_TOKEN not set',
url = ( file=sys.stderr,
arguments.base_url flush=True,
+ ('' if arguments.base_url.endswith('/') else '/')
+ arguments.collection
+ '/'
) )
if arguments.curated: if arguments.curated:
url += f'curated/' write_record = curated_write_record
url += f'record/{arguments.cls}' else:
write_record = collection_write_record
posted = False posted = False
for line in sys.stdin: for line in sys.stdin:
rec = json.loads(line) record = json.loads(line)
try: try:
post_to_url(url, token, rec) write_record(
service_url=arguments.base_url,
collection=arguments.collection,
class_name=arguments.cls,
record=record,
token=token,
)
except Exception as e: except Exception as e:
print(e) print(f'Error: {e}', file=sys.stderr, flush=True)
else: else:
posted = True posted = True
print('.', end='', flush=True) print('.', end='', flush=True)
if posted: if posted:
# final newline # final newline
print('') print('')

View file

@ -4,41 +4,172 @@ import argparse
import json import json
import os import os
import sys import sys
from functools import partial
from triple_tools.communicate import get_all from dump_things_pyclient.communicate import (
HTTPError,
collection_read_records,
collection_read_records_of_class,
collection_read_record_with_pid,
curated_read_records,
curated_read_records_of_class,
curated_read_record_with_pid,
incoming_read_labels,
incoming_read_records,
incoming_read_records_of_class,
incoming_read_record_with_pid,
)
token_name = 'DUMPTHINGS_TOKEN'
description = f"""Get records from a collection on a dump-things-service
This command lists records that are stored in a dump-things-service. By
default all records that are readable with the given token, or the default
token, will be displayed. The output format is JSONL (JSON lines), where
every line contains a record or a record with paging information. If `ttl`
is chosen as format of the output records, the record content will be a string
that contains a TTL-documents.
The command supports to read from the curated area only, to read from incoming
areas, or to read records with a given PID.
Pagination information is returned for paginated results, when requested with
`-P/--pagination`. All results are paginated except "get a record with a given PID"
and "get the list of incoming zone labels".
If the environment variable "{token_name}" is set, its content will be used
as token to authenticate against the dump-things-service.
"""
def _main():
argument_parser = argparse.ArgumentParser(
description=description,
formatter_class=argparse.RawDescriptionHelpFormatter,
)
argument_parser.add_argument('service_url')
argument_parser.add_argument('collection')
argument_parser.add_argument('-c', '--class', dest='class_name', help='only read records of this class, ignored if "--pid" is provided')
argument_parser.add_argument('-f', '--format', help='format of the output records ("json" or "ttl")')
argument_parser.add_argument('-p', '--pid', help='the pid of the record that should be read')
argument_parser.add_argument('-i', '--incoming', metavar='LABEL', help='read from incoming area with the given label in the collection, if LABEL is "-", return the labels')
argument_parser.add_argument('-C', '--curated', action='store_true', help='read from the curated area of the collection')
argument_parser.add_argument('-m', '--matching', help='return only records that have a matching value (use % as wildcard). Ignored if "--pid" is provided. (NOTE: not all endpoints and backends support matching.)')
argument_parser.add_argument('-s', '--page-size', type=int, help='set the page size (1 - 100) (default: 100), ignored if "--pid" is provided')
argument_parser.add_argument('-F', '--first-page', type=int, help='the first page to return (default: 1), ignored if "--pid" is provided')
argument_parser.add_argument('-l', '--last-page', type=int, default=None, help='the last page to return (default: None (return all pages), ignored if "--pid" is provided')
argument_parser.add_argument('--stats', action='store_true', help='show the number of records and pages and exit, ignored if "--pid" is provided')
argument_parser.add_argument('-P', '--pagination', action='store_true', help='show pagination information (each record from an paginated endpoint is returned as [<record>, <current page number>, <total number of pages>, <page size>, <total number of items>]')
arguments = argument_parser.parse_args()
if arguments.parameter:
print(
f'WARNING: option -p/--parameter is ignored, use existing options instead',
file=sys.stderr,
flush=True)
token = os.environ.get(token_name)
if token is None:
print(f'WARNING: {token_name} not set', file=sys.stderr, flush=True)
if arguments.incoming and arguments.curated:
print(
'ERROR: -i/--incoming and -c/--curated are mutually exclusive',
file=sys.stderr,
flush=True)
return 1
kwargs = dict(
service_url=arguments.service_url,
collection=arguments.collection,
token=token,
)
if arguments.incoming == '-':
result = incoming_read_labels(**kwargs)
print('\n'.join(
map(
partial(json.dumps, ensure_ascii=False),
result)))
return 0
elif arguments.pid:
for argument_value, argument_name in (
(arguments.matching, '-m/--matching'),
(arguments.page_size, '-s/--page_size'),
(arguments.first_page, '-F/--first_page'),
(arguments.last_page, '-l/--last_page'),
(arguments.stats, '--stats'),
(arguments.class_name, '-c/--class'),
):
if argument_value:
print(
f'WARNING: {argument_name} ignored because "-p/--pid" is provided',
file=sys.stderr,
flush=True)
kwargs['pid'] = arguments.pid
if arguments.curated:
result = curated_read_record_with_pid(**kwargs)
elif arguments.incoming:
kwargs['label'] = arguments.incoming
result = incoming_read_record_with_pid(**kwargs)
else:
kwargs['format'] = arguments.format
result = collection_read_record_with_pid(**kwargs)
print(json.dumps(result, ensure_ascii=False))
return 0
elif arguments.class_name:
kwargs.update(dict(
class_name=arguments.class_name,
matching=arguments.matching,
page=arguments.first_page or 1,
size=arguments.page_size or 100,
last_page=arguments.last_page,
))
if arguments.curated:
result = curated_read_records_of_class(**kwargs)
elif arguments.incoming:
kwargs['label'] = arguments.incoming
result = incoming_read_records_of_class(**kwargs)
else:
kwargs['format'] = arguments.format
result = collection_read_records_of_class(**kwargs)
else:
kwargs.update(dict(
matching=arguments.matching,
page=arguments.first_page or 1,
size=arguments.page_size or 100,
last_page=arguments.last_page,
))
if arguments.curated:
result = curated_read_records(**kwargs)
elif arguments.incoming:
kwargs['label'] = arguments.incoming
result = incoming_read_records(**kwargs)
else:
kwargs['format'] = arguments.format
result = collection_read_records(**kwargs)
if arguments.pagination:
for record in result:
print(json.dumps(record, ensure_ascii=False))
else:
for record in result:
print(json.dumps(record[0], ensure_ascii=False))
return 0
def main(): def main():
argument_parser = argparse.ArgumentParser() try:
argument_parser.add_argument('base_url') return _main()
argument_parser.add_argument('collection') except HTTPError as e:
argument_parser.add_argument('-s', '--size', type=int, default=100) print(f'ERROR: {e}: {e.response.text}', file=sys.stderr, flush=True)
argument_parser.add_argument('-p', '--parameter', action='append', default=[]) return 1
argument_parser.add_argument('-c', '--class', default=None, dest='cls')
arguments = argument_parser.parse_args()
token = os.environ.get('DUMPTHINGS_TOKEN')
if token is None:
print('WARNING: DUMPTHINGS_TOKEN not set', file=sys.stderr, flush=True)
url_base = (
arguments.base_url
+ ('' if arguments.base_url.endswith('/') else '/')
+ arguments.collection
+ f'/records/p/'
)
if arguments.cls:
url_base += f'{arguments.cls}/'
parameters = {'size': str(arguments.size)}
parameters.update({
param.split('=', 1)[0]: param.split('=', 1)[1]
for param in (arguments.parameter or [])
})
for json_object in get_all(url_base, token, parameters=parameters):
print(json.dumps(json_object))
if __name__ == '__main__': if __name__ == '__main__':

View file

@ -0,0 +1,87 @@
from __future__ import annotations
import argparse
import json
import os
import sys
from dump_things_pyclient.communicate import (
HTTPError,
get_paginated,
)
token_name = 'DUMPTHINGS_TOKEN'
description = f"""Read paginated endpoint
This command lists all records that are available via paginated endpoints from
a dump-things-service, e.g., from:
https://<service-location>/<collection>/records/p/
If the environment variable "{token_name}" is set, its content will be used
as token to authenticate against the dump-things-service.
"""
def _main():
argument_parser = argparse.ArgumentParser(
description=description,
formatter_class=argparse.RawDescriptionHelpFormatter,
)
argument_parser.add_argument('url', help='url of the paginated endpoint of the dump-things-service')
argument_parser.add_argument('-s', '--page-size', type=int, default=100, help='set the page size (1 - 100) (default: 100)')
argument_parser.add_argument('-F', '--first-page', type=int, default=1, help='the first page to return (default: 1)')
argument_parser.add_argument('-l', '--last-page', type=int, default=None, help='the last page to return (default: None (return all pages)')
argument_parser.add_argument('--stats', action='store_true', help='show information about the number of records and pages and exit, the format is is returned as [<total number of pages>, <page size>, <total number of items>]')
argument_parser.add_argument('-f', '--format', help='format of the output records ("json" or "ttl"). (NOTE: not all endpoints support the format parameter.)')
argument_parser.add_argument('-m', '--matching', help='return only records that have a matching value (use %% as wildcard). (NOTE: not all endpoints and backends support matching.)')
argument_parser.add_argument('-p', '--pagination', action='store_true', help='show pagination information (each record from an paginated endpoint is returned as [<record>, <current page number>, <total number of pages>, <page size>, <total number of items>]')
arguments = argument_parser.parse_args()
token = os.environ.get(token_name)
if token is None:
print(f'WARNING: {token_name} not set', file=sys.stderr, flush=True)
result = get_paginated(
url=arguments.url,
token=token,
first_page=arguments.first_page,
page_size=arguments.page_size,
last_page=arguments.last_page,
parameters={
'format': arguments.format,
**({'matching': arguments.matching}
if arguments.matching is not None
else {}
),
}
)
if arguments.stats:
record = next(result)
print(json.dumps(record[2:], ensure_ascii=False))
return 0
if arguments.pagination:
for record in result:
print(json.dumps(record, ensure_ascii=False))
else:
for record in result:
print(json.dumps(record[0], ensure_ascii=False))
return 0
def main():
try:
return _main()
except HTTPError as e:
print(f'ERROR: {e}: {e.response.text}', file=sys.stderr, flush=True)
return 1
if __name__ == '__main__':
sys.exit(main())