Use dump_things_pyclient to implement triple-tools #3

Merged
cmo merged 20 commits from use-pyclient into main 2025-12-11 19:40:24 +00:00
12 changed files with 610 additions and 280 deletions

145
README.md
View file

@ -19,27 +19,73 @@ Perform the following operations, preferably in a Python-virtual environment.
## The commands
This project provided the following CLI commands:
- auto-curate: automatically move records from inboxes to the curated area of a collection
- clean-incoming: delete all records from an inbox of a collection
- list-incoming: list records in inboxes of a collection
- post-records: read records from stdin and post them to inbox or curated area of a collection
- read-pages: read records from collection, curated area of a collection, or specific inboxes
- read-paginated-url: read records from any paginated service endpoints
- build-local-triple-store: read all records from a collection and emit N-Triples
The following section show the help message for those commands
#### read-pages
Read all pages from a paginated endpoint.
```
usage: read_pages [-h] [-s SIZE] [-p PARAMETER] base_url collection
usage: read-pages [-h] [-c CLASS_NAME] [-f FORMAT] [-p PID] [-i LABEL] [-C] [-m MATCHING] [-s PAGE_SIZE] [-F FIRST_PAGE] [-l LAST_PAGE] [--stats] [-P] service_url collection
Get records from a collection on a dump-things-service
This command lists records that are stored in a dump-things-service. By
default all records that are readable with the given token, or the default
token, will be displayed. The output format is JSONL (JSON lines), where
every line contains a record or a record with paging information. If `ttl`
is chosen as format of the output records, the record content will be a string
that contains a TTL-documents.
The command supports to read from the curated area only, to read from incoming
areas, or to read records with a given PID.
Pagination information is returned for paginated results, when requested with
`-P/--pagination`. All results are paginated except "get a record with a given PID"
and "get the list of incoming zone labels".
If the environment variable "DUMPTHINGS_TOKEN" is set, its content will be used
as token to authenticate against the dump-things-service.
positional arguments:
base_url
service_url
collection
options:
-h, --help show this help message and exit
-s, --size SIZE default: 100
-p, --parameter PARAMETER (key=value)
-c, --class limit to a particular class (name)
-c, --class CLASS_NAME
only read records of this class, ignored if "--pid" is provided
-f, --format FORMAT format of the output records ("json" or "ttl")
-p, --pid PID the pid of the record that should be read
-i, --incoming LABEL read from incoming area with the given label in the collection, if LABEL is "-", return the labels
-C, --curated read from the curated area of the collection
-m, --matching MATCHING
return only records that have a matching value (use {'option_strings': ['-m', '--matching'], 'dest': 'matching', 'nargs': None, 'const': None, 'default': None, 'type': None, 'choices': None,
'required': False, 'help': 'return only records that have a matching value (use % as wildcard). Ignored if "--pid" is provided. (NOTE: not all endpoints and backends support matching.)', 'metavar':
None, 'deprecated': False, 'container': <argparse._ArgumentGroup object at 0x7fab8219b610>, 'prog': 'read-pages'}s wildcard). Ignored if "--pid" is provided. (NOTE: not all endpoints and backends
support matching.)
-s, --page-size PAGE_SIZE
set the page size (1 - 100) (default: 100), ignored if "--pid" is provided
-F, --first-page FIRST_PAGE
the first page to return (default: 1), ignored if "--pid" is provided
-l, --last-page LAST_PAGE
the last page to return (default: None (return all pages), ignored if "--pid" is provided
--stats show the number of records and pages and exit, ignored if "--pid" is provided
-P, --pagination show pagination information (each record from an paginated endpoint is returned as [<record>, <current page number>, <total number of pages>, <page size>, <total number of items>]
```
For a given `<base_url>` and `<collection>` the tool will read all pages
returned by `<base_url>/<collection>/records/p/`.
returned by `<base_url>/<collection>/records/p/`, or the respective inbox or the curated area.
The tool reads a token from the environment variable `DUMPTHINGS_TOKEN` if set.
@ -57,8 +103,8 @@ positional arguments:
class
options:
-h, --help show this help message and exit
--curated bypass inbox, requires curator token
-h, --help show this help message and exit
--curated bypass inbox, requires curator token
```
For a given `<base_url>`, `<collection>`, and `<class>` the tool will
@ -73,10 +119,15 @@ The tool reads a token from the environment variable `DUMPTHINGS_TOKEN`.
Move records from inboxes into the curated part of a collection.
```
usage: auto_curate [-h] [--destination-base-url DEST_SERVICE_URL] [--destination-collection DEST_COLLECTION] [--destination-token DEST_TOKEN] [--exclude [EXCLUDE ...]] [--list-labels] [--list-only] [-p PID]
SOURCE_SERVICE_URL SOURCE_COLLECTION
usage: auto-curate [-h] [--destination-service-url DEST_SERVICE_URL] [--destination-collection DEST_COLLECTION] [--destination-token DEST_TOKEN] [-e EXCLUDE] [-l] [-r] [-o] [-p PID] SOURCE_SERVICE_URL SOURCE_COLLECTION
Automatically move records from the incoming areas of a collection to the curated area of the same collection, or to the incoming area of another collection.
Automatically move records from the incoming areas of a
collection to the curated area of the same collection, or to
the curated area of another collection.
The environment variable "DUMPTHINGS_TOKEN" must contain a token
which used to authenticate the requests. The token must have
curator-rights.
positional arguments:
SOURCE_SERVICE_URL
@ -84,21 +135,21 @@ positional arguments:
options:
-h, --help show this help message and exit
--destination-base-url DEST_SERVICE_URL
--destination-service-url DEST_SERVICE_URL
select a different dump-thing-service, i.e. not SOURCE_SERVICE_URL, as destination for auto-curated records
--destination-collection DEST_COLLECTION
select a different collection, i.e. not the SOURCE_COLLECTION of SOURCE_SERVICE_URL, as destination for auto-curated records
--destination-token DEST_TOKEN
if provided, this token will be used for the destination service, otherwise ${CURATOR_TOKEN} will be used
--exclude, -e [EXCLUDE ...]
exclude an inbox on the source collection
--list-labels, -l
--list-only, -o
-p, --pid PID if provided, process only records that match the given PIDs. NOTE: matching does not involve CURIE-resolution!
if provided, this token will be used for the destination service, otherwise $DUMPTHINGS_TOKEN will be used
-e, --exclude EXCLUDE
exclude an inbox on the source collection (repeatable)
-l, --list-labels list the inbox labels of the given source collection, do not perform any curation
-r, --list-records list records in the inboxes of the given source collection, do not perform any curation
-o, --list-only [DEPRECATED: use "--list-records"] list records in the inboxes of the given source collection, do not perform any curation
-p, --pid PID if provided, process only records that match the given PIDs
```
`auto-curate` requires that the environment variable `CURATOR_TOKEN` is set, and contains a valid curator-token.
`auto-curate` requires that the environment variable DUMPTHINGS_TOKEN is set, and contains a valid curator-token.
#### build-local-triple-store
@ -149,7 +200,7 @@ options:
List the labels of all inboxes of a given collection
```
usage: list-incoming [-h] [--show-records] base_url collection
usage: list-incoming [-h] [-s] base_url collection
positional arguments:
base_url
@ -157,10 +208,10 @@ positional arguments:
options:
-h, --help show this help message and exit
--show-records, -s show the records in the inboxes as well
-s, --show-records show the records in the inboxes as well
```
`list-incoming` requires that the environment variable `CURATOR_TOKEN` is set, and contains a valid curator-token
`list-incoming` requires that the environment variable `CURATOR_TOKEN` is set, and contains a valid curator-token.
#### json2ttl
@ -171,8 +222,14 @@ contain TTL-documents with one string per line.
```
usage: json2ttl [-h] schema
Read JSON records from stdin and convert them to TTL
This command reads one record per line, either JSON format or a JSON-string
with a TTL-document from stdin, converts them to TTL or JSON and prints them
to stdout.
positional arguments:
schema
schema URL of the schema that should be used
options:
-h, --help show this help message and exit
@ -187,6 +244,44 @@ records in a collection to TTL:
...
```
#### read-paginated-url
General tool to read from any paginated endpoint of a dump-things-service
```
usage: read-paginated-url [-h] [-s PAGE_SIZE] [-F FIRST_PAGE] [-l LAST_PAGE] [--stats] [-f FORMAT] [-m MATCHING] [-p] url
Read paginated endpoint
This command lists all records that are available via paginated endpoints from
a dump-things-service, e.g., from:
https://<service-location>/<collection>/records/p/
If the environment variable "DUMPTHINGS_TOKEN" is set, its content will be used
as token to authenticate against the dump-things-service.
positional arguments:
url url of the paginated endpoint of the dump-things-service
options:
-h, --help show this help message and exit
-s, --page-size PAGE_SIZE
set the page size (1 - 100) (default: 100)
-F, --first-page FIRST_PAGE
the first page to return (default: 1)
-l, --last-page LAST_PAGE
the last page to return (default: None (return all pages)
--stats show information about the number of records and pages and exit, the format is is returned as [<total number of pages>, <page size>, <total number of items>]
-f, --format FORMAT format of the output records ("json" or "ttl"). (NOTE: not all endpoints support the format parameter.)
-m, --matching MATCHING
return only records that have a matching value (use % as wildcard). (NOTE: not all endpoints and backends support matching.)
-p, --pagination show pagination information (each record from an paginated endpoint is returned as [<record>, <current page number>, <total number of pages>, <page size>, <total number of items>]
```
`read-paginated-url` reads a token from the environment variable `DUMPTHINGS_TOKEN` if it is set.
## SPARQL search over a collection with qlever
The provide SPARQL search for a collection the following steps are necessary:
@ -194,7 +289,7 @@ The provide SPARQL search for a collection the following steps are necessary:
1. Create N-Triple representation of the records of the store
2. Build a qlever index
3. Start the qlever server
4. Use alever query to send SPARQL queries to the server
4. Use qlever query to send SPARQL queries to the server
----

View file

@ -24,6 +24,7 @@ classifiers = [
"Programming Language :: Python :: Implementation :: PyPy",
]
dependencies = [
"dump-things-pyclient",
"dump-things-service",
"progress",
"qlever",
@ -44,6 +45,7 @@ list-incoming = "triple_tools.list_incoming:main"
post-records = "triple_tools.post_records:main"
read-pages = "triple_tools.read_pages:main"
json2ttl = "triple_tools.json2ttl:main"
read-paginated-url = "triple_tools.read_paginated_url:main"
[tool.hatch.build.targets.wheel]
exclude = [

View file

@ -1 +1 @@
__version__ = '0.2.2'
__version__ = '0.2.3'

View file

@ -1,33 +1,47 @@
from __future__ import annotations
import argparse
import json
import logging
import os
import re
import sys
from urllib.parse import quote_plus
from triple_tools.communicate import (
delete_url,
get_labels,
get_records_from_label,
post_to_url,
from dump_things_pyclient.communicate import (
HTTPError,
curated_write_record,
incoming_delete_record,
incoming_read_labels,
incoming_read_records,
)
def main():
logger = logging.getLogger('auto_curate')
token_name = 'DUMPTHINGS_TOKEN'
stl_info = False
description=f"""
Automatically move records from the incoming areas of a
collection to the curated area of the same collection, or to
the curated area of another collection.
The environment variable "{token_name}" must contain a token
which used to authenticate the requests. The token must have
curator-rights.
"""
def _main():
argument_parser = argparse.ArgumentParser(
prog='auto_curate',
description="""
Automatically move records from the incoming areas of a
collection to the curated area of the same collection, or to
the incoming area of another collection.
"""
description=description,
formatter_class=argparse.RawDescriptionHelpFormatter,
)
argument_parser.add_argument('base_url', metavar='SOURCE_SERVICE_URL')
argument_parser.add_argument('service_url', metavar='SOURCE_SERVICE_URL')
argument_parser.add_argument('collection', metavar='SOURCE_COLLECTION')
argument_parser.add_argument(
'--destination-base-url',
'--destination-service-url',
default=None,
metavar='DEST_SERVICE_URL',
help='select a different dump-thing-service, i.e. not SOURCE_SERVICE_URL, as destination for auto-curated records',
@ -42,71 +56,144 @@ def main():
'--destination-token',
default=None,
metavar='DEST_TOKEN',
help='if provided, this token will be used for the destination service, otherwise ${CURATOR_TOKEN} will be used',
help=f'if provided, this token will be used for the destination service, otherwise ${token_name} will be used',
)
argument_parser.add_argument('--exclude', '-e', nargs='*', default=[], help='exclude an inbox on the source collection')
argument_parser.add_argument('--list-labels', '-l', action='store_true')
argument_parser.add_argument('--list-only', '-o', action='store_true')
argument_parser.add_argument(
'-p', '--pid', action='append',
help='if provided, process only records that match the given PIDs. NOTE: matching does not involve CURIE-resolution!',
'-e', '--exclude',
action='append',
default=[],
help='exclude an inbox on the source collection (repeatable)',
)
argument_parser.add_argument(
'-l', '--list-labels',
action='store_true',
help='list the inbox labels of the given source collection, do not perform any curation',
)
argument_parser.add_argument(
'-r', '--list-records',
action='store_true',
help='list records in the inboxes of the given source collection, do not perform any curation',
)
argument_parser.add_argument(
'-o', '--list-only',
action='store_true',
help='[DEPRECATED: use "--list-records"] list records in the inboxes of the given source collection, do not perform any curation',
)
argument_parser.add_argument(
'-p', '--pid',
action='append',
help='if provided, process only records that match the given PIDs',
)
arguments = argument_parser.parse_args()
print(arguments)
curator_token = os.environ.get('CURATOR_TOKEN')
curator_token = os.environ.get(token_name)
if curator_token is None:
print('ERROR: CURATOR_TOKEN not set', file=sys.stderr, flush=True)
print(f'ERROR: environment variable "{token_name}" not set', file=sys.stderr, flush=True)
return 1
destination_url = arguments.destination_base_url or arguments.base_url
destination_url = arguments.destination_service_url or arguments.service_url
destination_collection = arguments.destination_collection or arguments.collection
destination_token = arguments.destination_token or curator_token
for label in get_labels(
url_base=arguments.base_url,
collection=arguments.collection,
token=curator_token
):
output = None
# If --list-labels and --list-records are provided, keep only the latter,
# because it includes listing of labels
if arguments.list_records:
if arguments.list_labels:
print(label)
continue
print('WARNING: `-l/--list-labels` and `-r/--list-records` defined, ignoring `-l/--list-labels`', file=sys.stderr, flush=True)
arguments.list_labels = False
output = {}
if arguments.list_labels:
output = []
for label in incoming_read_labels(
service_url=arguments.service_url,
collection=arguments.collection,
token=curator_token):
if label in arguments.exclude:
logger.debug('ignoring excluded incoming label: %s', label)
continue
for record in get_records_from_label(
url_base=arguments.base_url,
collection=arguments.collection,
label=label,
token=curator_token
):
if arguments.list_labels:
output.append(label)
continue
if arguments.list_records:
output[label] = []
for record, _, _, _, _ in incoming_read_records(
service_url=arguments.service_url,
collection=arguments.collection,
label=label,
token=curator_token):
if arguments.pid:
if record['pid'] not in arguments.pid:
logger.debug(
'ignoring record with non-matching pid: %s',
record['pid'])
continue
if arguments.list_only:
print(f'{label}:\t{record}')
if arguments.list_records or arguments.list_only:
output[label].append(record)
continue
class_name = re.search('([_A-Za-z0-9]*$)', record['schema_type']).group(0)
# Store record in collection
post_to_url(
f'{destination_url}/{destination_collection}/curated/record/{class_name}',
token=destination_token,
content=record,
)
# Get the class name from the `schema_type` attribute. This requires
# that the schema type is either stored in the record or that the
# store has a "Schema Type Layer", i.e., the store type is
# `record_dir+stl`, or `sqlite+stl`.
try:
class_name = re.search('([_A-Za-z0-9]*$)', record['schema_type']).group(0)
except IndexError:
global stl_info
if not stl_info:
print(
f"""Could not find `schema_type` attribute in record with
pid {record['pid']}. Please ensure that `schema_type` is stored in
the records or that the associated incoming area store has a backend
with a "Schema Type Layer", i.e., "record_dir+stl" or
"sqlite+stl".""",
file=sys.stderr,
flush=True)
stl_info = True
print(
f'WARNING: ignoring record with pid {record["pid"]}, `schema_type` attribute is missing.',
file=sys.stderr,
flush=True)
continue
# Store record in destination collection
curated_write_record(
service_url=destination_url,
collection=destination_collection,
class_name=class_name,
record=record,
token=destination_token)
# Delete record from incoming area
url = f'{arguments.base_url}/{arguments.collection}/incoming/{label}/record?pid={quote_plus(record["pid"])}'
delete_url(
url=url,
incoming_delete_record(
service_url=arguments.service_url,
collection=arguments.collection,
label=label,
pid=record['pid'],
token=curator_token,
)
if output is not None:
print(json.dumps(output, ensure_ascii=False))
return 0
def main():
try:
return _main()
except HTTPError as e:
print(f'ERROR: {e}: {e.response.text}', file=sys.stderr, flush=True)
return 1
if __name__ == '__main__':
sys.exit(main())

View file

@ -9,10 +9,13 @@ import sys
from dump_things_service.converter import Format, FormatConverter
from rdflib import Graph
from triple_tools.communicate import get_all
from dump_things_pyclient.communicate import (
HTTPError,
get_paginated,
)
def main():
def _main():
argument_parser = argparse.ArgumentParser()
argument_parser.add_argument('schema')
argument_parser.add_argument('base_url')
@ -22,8 +25,7 @@ def main():
token = os.environ.get('DUMPTHINGS_TOKEN')
if token is None:
print('WARNING: DUMPTHINGS_TOKEN not set', file=sys.stderr, flush=True)
print('WARNING: environment variable DUMPTHINGS_TOKEN not set', file=sys.stderr, flush=True)
print(f'Creating converter for schema {arguments.schema} ...', file=sys.stderr, end='', flush=True)
converter = FormatConverter(
@ -41,7 +43,7 @@ def main():
)
g = Graph()
for json_object in get_all(url_base, os.environ.get('DUMPTHINGS_TOKEN'), {'size': '100'}, show_progress=True):
for json_object in get_paginated(url_base, page_size=100, token=os.environ.get('DUMPTHINGS_TOKEN')):
object_class = json_object.get('schema_type')
if object_class is None:
raise ValueError(f'No schema_type in {json_object}')
@ -51,7 +53,7 @@ def main():
try:
ttl = converter.convert(json_object, class_name)
except ValueError as ve:
print(f'\nWARNING: could not convert record {json_object["pid"]}: {ve}', file=sys.stderr, flush=True)
print(f'WARNING: could not convert record {json_object["pid"]}: {ve}', file=sys.stderr, flush=True)
continue
g.parse(io.StringIO(ttl), format='n3')
@ -59,5 +61,13 @@ def main():
return 0
def main():
try:
return _main()
except HTTPError as e:
print(f'ERROR: {e}: {e.response.text}', file=sys.stderr, flush=True)
return 1
if __name__ == '__main__':
sys.exit(main())

View file

@ -4,28 +4,29 @@ import argparse
import os
import sys
from triple_tools.communicate import (
delete_url,
get_records_from_label,
from dump_things_pyclient.communicate import (
HTTPError,
incoming_delete_record,
incoming_read_records,
)
def main():
def _main():
argument_parser = argparse.ArgumentParser()
argument_parser.add_argument('base_url')
argument_parser.add_argument('collection')
argument_parser.add_argument('label')
argument_parser.add_argument('--list-only', '-l', action='store_true')
argument_parser.add_argument('--list-only', '-l', action='store_true', help="list records in the inbox, don't remove them")
arguments = argument_parser.parse_args()
curator_token = os.environ.get('CURATOR_TOKEN')
if curator_token is None:
print('ERROR: CURATOR_TOKEN not set', file=sys.stderr, flush=True)
print('ERROR: environment variable CURATOR_TOKEN not set', file=sys.stderr, flush=True)
return 1
for record in get_records_from_label(
url_base=arguments.base_url,
for record, _, _, _, _ in incoming_read_records(
service_url=arguments.base_url,
collection=arguments.collection,
label=arguments.label,
token=curator_token,
@ -35,13 +36,24 @@ def main():
continue
# Delete record from incoming area
label_url = f'{arguments.base_url}/{arguments.collection}/incoming/{arguments.label}'
delete_url(
url = f'{label_url}/record?pid={record["pid"]}',
incoming_delete_record(
service_url=arguments.base_url,
collection=arguments.collection,
label=arguments.label,
pid=record['pid'],
token=curator_token,
)
return 0
def main():
try:
return _main()
except HTTPError as e:
print(f'ERROR: {e}: {e.response.text}', file=sys.stderr, flush=True)
return 1
if __name__ == '__main__':
sys.exit(main())

View file

@ -1,130 +0,0 @@
from __future__ import annotations
from collections.abc import Iterable
from urllib.parse import quote_plus
import requests
from progress.bar import Bar
def _create_url(
url_base: str,
parameters: dict[str, str] | None = None,
page_number: int | None = None,
):
parameters = parameters or {}
parameters.update({'page': str(page_number)})
all_parameters = [f'{k}={quote_plus(v)}' for k, v in parameters.items()]
return url_base + '?' + '&'.join(all_parameters)
def _get_page(
url_base: str,
token: str | None = None,
parameters: Iterable[str] | None = None,
page_number: int | None = None,
):
return get_from_url(_create_url(url_base, parameters, page_number), token)
def get_all(
url_base: str,
token: str | None = None,
parameters: dict[str, str] | None = None,
show_progress: bool = False,
):
# Get the first result and the number of pages
result = _get_page(url_base, token, parameters, page_number=1)
total_pages = result['pages']
if total_pages == 0:
return
if show_progress:
bar = Bar('Pages', max=total_pages, suffix='%(index)d/%(max)d - %(eta_td)s')
yield from result['items']
bar.next()
else:
yield from result['items']
# Get remaining results
for page in range(2, total_pages + 1):
result = _get_page(url_base, token, parameters, page_number=page)
yield from result['items']
if show_progress:
bar.next()
if show_progress:
bar.finish()
def check_result(
result: requests.Response,
method: str,
url: str
):
if not 200 <= result.status_code < 300:
msg = f'HTTP {method} {url} failed: {result.status_code}: {result.text}'
raise RuntimeError(msg)
def get_from_url(
url: str,
token: str,
):
r = requests.get(
url,
headers=({
'x-dumpthings-token': token,
} if token else {}),
)
check_result(r, 'GET', url)
return r.json()
def post_to_url(
url: str,
token: str | None,
content: list | dict
):
r = requests.post(
url,
headers=({
'x-dumpthings-token': token,
} if token else {}),
json=content,
)
check_result(r, 'POST', url)
return r.json()
def delete_url(
url: str,
token: str | None,
):
r = requests.delete(
url,
headers=({
'x-dumpthings-token': token,
} if token else {}),
)
check_result(r, 'DELETE', url)
return r.json()
def get_labels(
url_base: str,
collection: str,
token: str | None = None,
):
yield from get_from_url(f'{url_base}/{collection}/incoming/', token)
def get_records_from_label(
url_base: str,
collection,
label: str,
token: str | None = None,
parameters: dict[str, str] | None = None,
):
label_url = f'{url_base}/{collection}/incoming/{label}/records/p/'
yield from get_all(label_url, token=token, parameters=parameters)

View file

@ -11,9 +11,21 @@ from dump_things_service.converter import (
)
description = f"""Read JSON records from stdin and convert them to TTL
This command reads one record per line, either JSON format or a JSON-string
with a TTL-document from stdin, converts them to TTL or JSON and prints them
to stdout.
"""
def main():
argument_parser = argparse.ArgumentParser()
argument_parser.add_argument('schema')
argument_parser = argparse.ArgumentParser(
description=description,
formatter_class=argparse.RawDescriptionHelpFormatter,
)
argument_parser.add_argument('schema', help='URL of the schema that should be used')
arguments = argument_parser.parse_args()
@ -26,16 +38,16 @@ def main():
print(' done', file=sys.stderr, flush=True)
error = False
for line in sys.stdin:
json_object = json.loads(line)
object_class = json_object.get('schema_type')
if object_class is None:
error = True
print(f'ERROR: No schema_type in {json_object}', file=sys.stderr, flush=True)
continue
class_name = re.search('([_A-Za-z0-9]*$)', object_class).group(0)
try:
ttl = converter.convert(json_object, class_name)
except ValueError as ve:

View file

@ -1,45 +1,60 @@
from __future__ import annotations
import argparse
import json
import os
import sys
from collections import defaultdict
from triple_tools.communicate import (
get_labels,
get_records_from_label,
from dump_things_pyclient.communicate import (
HTTPError,
incoming_read_labels,
incoming_read_records,
)
def main():
def _main():
argument_parser = argparse.ArgumentParser()
argument_parser.add_argument('base_url')
argument_parser.add_argument('collection')
argument_parser.add_argument('--show-records', '-s', action='store_true')
argument_parser.add_argument('-s', '--show-records', action='store_true', help='show the records in the inboxes as well')
arguments = argument_parser.parse_args()
curator_token = os.environ.get('CURATOR_TOKEN')
if curator_token is None:
print('ERROR: CURATOR_TOKEN not set', file=sys.stderr, flush=True)
print('ERROR: environment variable CURATOR_TOKEN not set', file=sys.stderr, flush=True)
return 1
for label in get_labels(
url_base=arguments.base_url,
result = {}
for label in incoming_read_labels(
service_url=arguments.base_url,
collection=arguments.collection,
token=curator_token,
):
print(label)
result[label] = []
if arguments.show_records:
for record in get_records_from_label(
url_base=arguments.base_url,
for record, _, _, _, _ in incoming_read_records(
service_url=arguments.base_url,
collection=arguments.collection,
label=label,
token=curator_token,
):
print('\t', record)
result[label].append(record)
if arguments.show_records is False:
result = list(result)
print(json.dumps(result, indent=2, ensure_ascii=False))
return 0
def main():
try:
return _main()
except HTTPError as e:
print(f'ERROR: {e}: {e.response.text}', file=sys.stderr, flush=True)
return 1
if __name__ == '__main__':
sys.exit(main())

View file

@ -5,42 +5,51 @@ import json
import os
import sys
from triple_tools.communicate import post_to_url
from dump_things_pyclient.communicate import (
collection_write_record,
curated_write_record,
)
def main():
argument_parser = argparse.ArgumentParser()
argument_parser.add_argument('base_url')
argument_parser.add_argument('collection')
argument_parser.add_argument('cls')
argument_parser.add_argument('--curated', action='store_true')
argument_parser.add_argument('cls', metavar='class')
argument_parser.add_argument('--curated', action='store_true', help='bypass inbox, requires curator token')
arguments = argument_parser.parse_args()
token = os.environ.get('DUMPTHINGS_TOKEN')
if token is None:
print('WARNING: DUMPTHINGS_TOKEN not set', file=sys.stderr, flush=True)
print(
'WARNING: environment variable DUMPTHINGS_TOKEN not set',
file=sys.stderr,
flush=True,
)
url = (
arguments.base_url
+ ('' if arguments.base_url.endswith('/') else '/')
+ arguments.collection
+ '/'
)
if arguments.curated:
url += f'curated/'
url += f'record/{arguments.cls}'
write_record = curated_write_record
else:
write_record = collection_write_record
posted = False
for line in sys.stdin:
rec = json.loads(line)
record = json.loads(line)
try:
post_to_url(url, token, rec)
write_record(
service_url=arguments.base_url,
collection=arguments.collection,
class_name=arguments.cls,
record=record,
token=token,
)
except Exception as e:
print(e)
print(f'Error: {e}', file=sys.stderr, flush=True)
else:
posted = True
print('.', end='', flush=True)
if posted:
# final newline
print('')

View file

@ -4,41 +4,172 @@ import argparse
import json
import os
import sys
from functools import partial
from triple_tools.communicate import get_all
from dump_things_pyclient.communicate import (
HTTPError,
collection_read_records,
collection_read_records_of_class,
collection_read_record_with_pid,
curated_read_records,
curated_read_records_of_class,
curated_read_record_with_pid,
incoming_read_labels,
incoming_read_records,
incoming_read_records_of_class,
incoming_read_record_with_pid,
)
token_name = 'DUMPTHINGS_TOKEN'
description = f"""Get records from a collection on a dump-things-service
This command lists records that are stored in a dump-things-service. By
default all records that are readable with the given token, or the default
token, will be displayed. The output format is JSONL (JSON lines), where
every line contains a record or a record with paging information. If `ttl`
is chosen as format of the output records, the record content will be a string
that contains a TTL-documents.
The command supports to read from the curated area only, to read from incoming
areas, or to read records with a given PID.
Pagination information is returned for paginated results, when requested with
`-P/--pagination`. All results are paginated except "get a record with a given PID"
and "get the list of incoming zone labels".
If the environment variable "{token_name}" is set, its content will be used
as token to authenticate against the dump-things-service.
"""
def _main():
argument_parser = argparse.ArgumentParser(
description=description,
formatter_class=argparse.RawDescriptionHelpFormatter,
)
argument_parser.add_argument('service_url')
argument_parser.add_argument('collection')
argument_parser.add_argument('-c', '--class', dest='class_name', help='only read records of this class, ignored if "--pid" is provided')
argument_parser.add_argument('-f', '--format', help='format of the output records ("json" or "ttl")')
argument_parser.add_argument('-p', '--pid', help='the pid of the record that should be read')
argument_parser.add_argument('-i', '--incoming', metavar='LABEL', help='read from incoming area with the given label in the collection, if LABEL is "-", return the labels')
argument_parser.add_argument('-C', '--curated', action='store_true', help='read from the curated area of the collection')
argument_parser.add_argument('-m', '--matching', help='return only records that have a matching value (use % as wildcard). Ignored if "--pid" is provided. (NOTE: not all endpoints and backends support matching.)')
argument_parser.add_argument('-s', '--page-size', type=int, help='set the page size (1 - 100) (default: 100), ignored if "--pid" is provided')
argument_parser.add_argument('-F', '--first-page', type=int, help='the first page to return (default: 1), ignored if "--pid" is provided')
argument_parser.add_argument('-l', '--last-page', type=int, default=None, help='the last page to return (default: None (return all pages), ignored if "--pid" is provided')
argument_parser.add_argument('--stats', action='store_true', help='show the number of records and pages and exit, ignored if "--pid" is provided')
argument_parser.add_argument('-P', '--pagination', action='store_true', help='show pagination information (each record from an paginated endpoint is returned as [<record>, <current page number>, <total number of pages>, <page size>, <total number of items>]')
arguments = argument_parser.parse_args()
if arguments.parameter:
print(
f'WARNING: option -p/--parameter is ignored, use existing options instead',
file=sys.stderr,
flush=True)
token = os.environ.get(token_name)
if token is None:
print(f'WARNING: {token_name} not set', file=sys.stderr, flush=True)
if arguments.incoming and arguments.curated:
print(
'ERROR: -i/--incoming and -c/--curated are mutually exclusive',
file=sys.stderr,
flush=True)
return 1
kwargs = dict(
service_url=arguments.service_url,
collection=arguments.collection,
token=token,
)
if arguments.incoming == '-':
result = incoming_read_labels(**kwargs)
print('\n'.join(
map(
partial(json.dumps, ensure_ascii=False),
result)))
return 0
elif arguments.pid:
for argument_value, argument_name in (
(arguments.matching, '-m/--matching'),
(arguments.page_size, '-s/--page_size'),
(arguments.first_page, '-F/--first_page'),
(arguments.last_page, '-l/--last_page'),
(arguments.stats, '--stats'),
(arguments.class_name, '-c/--class'),
):
if argument_value:
print(
f'WARNING: {argument_name} ignored because "-p/--pid" is provided',
file=sys.stderr,
flush=True)
kwargs['pid'] = arguments.pid
if arguments.curated:
result = curated_read_record_with_pid(**kwargs)
elif arguments.incoming:
kwargs['label'] = arguments.incoming
result = incoming_read_record_with_pid(**kwargs)
else:
kwargs['format'] = arguments.format
result = collection_read_record_with_pid(**kwargs)
print(json.dumps(result, ensure_ascii=False))
return 0
elif arguments.class_name:
kwargs.update(dict(
class_name=arguments.class_name,
matching=arguments.matching,
page=arguments.first_page or 1,
size=arguments.page_size or 100,
last_page=arguments.last_page,
))
if arguments.curated:
result = curated_read_records_of_class(**kwargs)
elif arguments.incoming:
kwargs['label'] = arguments.incoming
result = incoming_read_records_of_class(**kwargs)
else:
kwargs['format'] = arguments.format
result = collection_read_records_of_class(**kwargs)
else:
kwargs.update(dict(
matching=arguments.matching,
page=arguments.first_page or 1,
size=arguments.page_size or 100,
last_page=arguments.last_page,
))
if arguments.curated:
result = curated_read_records(**kwargs)
elif arguments.incoming:
kwargs['label'] = arguments.incoming
result = incoming_read_records(**kwargs)
else:
kwargs['format'] = arguments.format
result = collection_read_records(**kwargs)
if arguments.pagination:
for record in result:
print(json.dumps(record, ensure_ascii=False))
else:
for record in result:
print(json.dumps(record[0], ensure_ascii=False))
return 0
def main():
argument_parser = argparse.ArgumentParser()
argument_parser.add_argument('base_url')
argument_parser.add_argument('collection')
argument_parser.add_argument('-s', '--size', type=int, default=100)
argument_parser.add_argument('-p', '--parameter', action='append', default=[])
argument_parser.add_argument('-c', '--class', default=None, dest='cls')
arguments = argument_parser.parse_args()
token = os.environ.get('DUMPTHINGS_TOKEN')
if token is None:
print('WARNING: DUMPTHINGS_TOKEN not set', file=sys.stderr, flush=True)
url_base = (
arguments.base_url
+ ('' if arguments.base_url.endswith('/') else '/')
+ arguments.collection
+ f'/records/p/'
)
if arguments.cls:
url_base += f'{arguments.cls}/'
parameters = {'size': str(arguments.size)}
parameters.update({
param.split('=', 1)[0]: param.split('=', 1)[1]
for param in (arguments.parameter or [])
})
for json_object in get_all(url_base, token, parameters=parameters):
print(json.dumps(json_object))
try:
return _main()
except HTTPError as e:
print(f'ERROR: {e}: {e.response.text}', file=sys.stderr, flush=True)
return 1
if __name__ == '__main__':

View file

@ -0,0 +1,87 @@
from __future__ import annotations
import argparse
import json
import os
import sys
from dump_things_pyclient.communicate import (
HTTPError,
get_paginated,
)
token_name = 'DUMPTHINGS_TOKEN'
description = f"""Read paginated endpoint
This command lists all records that are available via paginated endpoints from
a dump-things-service, e.g., from:
https://<service-location>/<collection>/records/p/
If the environment variable "{token_name}" is set, its content will be used
as token to authenticate against the dump-things-service.
"""
def _main():
argument_parser = argparse.ArgumentParser(
description=description,
formatter_class=argparse.RawDescriptionHelpFormatter,
)
argument_parser.add_argument('url', help='url of the paginated endpoint of the dump-things-service')
argument_parser.add_argument('-s', '--page-size', type=int, default=100, help='set the page size (1 - 100) (default: 100)')
argument_parser.add_argument('-F', '--first-page', type=int, default=1, help='the first page to return (default: 1)')
argument_parser.add_argument('-l', '--last-page', type=int, default=None, help='the last page to return (default: None (return all pages)')
argument_parser.add_argument('--stats', action='store_true', help='show information about the number of records and pages and exit, the format is is returned as [<total number of pages>, <page size>, <total number of items>]')
argument_parser.add_argument('-f', '--format', help='format of the output records ("json" or "ttl"). (NOTE: not all endpoints support the format parameter.)')
argument_parser.add_argument('-m', '--matching', help='return only records that have a matching value (use %% as wildcard). (NOTE: not all endpoints and backends support matching.)')
argument_parser.add_argument('-p', '--pagination', action='store_true', help='show pagination information (each record from an paginated endpoint is returned as [<record>, <current page number>, <total number of pages>, <page size>, <total number of items>]')
arguments = argument_parser.parse_args()
token = os.environ.get(token_name)
if token is None:
print(f'WARNING: {token_name} not set', file=sys.stderr, flush=True)
result = get_paginated(
url=arguments.url,
token=token,
first_page=arguments.first_page,
page_size=arguments.page_size,
last_page=arguments.last_page,
parameters={
'format': arguments.format,
**({'matching': arguments.matching}
if arguments.matching is not None
else {}
),
}
)
if arguments.stats:
record = next(result)
print(json.dumps(record[2:], ensure_ascii=False))
return 0
if arguments.pagination:
for record in result:
print(json.dumps(record, ensure_ascii=False))
else:
for record in result:
print(json.dumps(record[0], ensure_ascii=False))
return 0
def main():
try:
return _main()
except HTTPError as e:
print(f'ERROR: {e}: {e.response.text}', file=sys.stderr, flush=True)
return 1
if __name__ == '__main__':
sys.exit(main())