Field name encoding specification for bidirectional conversion of nested JSON to a table representation

Find a file

mih e25086253b Compare to RFC6901		2025-10-18 17:03:06 +00:00
README.md	Compare to RFC6901	2025-10-18 17:03:06 +00:00

README.md

Field name encoding specification for bidirectional conversion of nested JSON to a table representation

Challenge

Spreadsheets are a widely known data representation and using spreadsheet applications is a common skill. Being able to present information in this format (if possible) helps making data more accessible.

Records compliant with the schema toolbox at https://concepts.datalad.org are often nested structures that cannot easily represented as a single table.

Representing a nested structure (using JSON terminology as an example here), requires putting all fields -- traversing items in arrays and objects -- into a single, one-dimensional sequence.

Moreover, if (back-)conversion from a table to the original nested structure is necessary, this original structure needs to be encoded in the table somehow.

Proposed solution

The original nested structure is encoded into the column headers/names of the table.
Without nesting (e.g., for a top-level object), the column name equals the field name (object key) exactly.
With nesting (involving an array or an object), the column name is wrapped into any single, specific (and customizable) character -- the first and last character of the column name are identical, and the identity of this character defines the nesting indicator/delimiter. Examples:
- name (no nesting)
- $....name$ (nesting with $ delimiter)
When a nesting indicator is present, one or more of the following conditions must be met in addition, or the indicator character is considered as a part of a regular column header. For readability, we use $ as the indicator character below, but any character could be chosen.

a. An array position indicator must be present somewhere after the top-level fieldname. This is the nesting indicator character, followed by a sequence of digits, representing an integer value of a zero-based array position. Example:
- $name$41$ (column name for the 41st value in the array at the field name)
b. A nested object key indicator must be present somewhere after the top-level fieldname. This is the nesting indicator character occurring twice in immediate succession, followed by the nested field name. Examples:
- $name$$first$
- $name$12$$first$
- $name$$other$12$
Non-sequentional array position indicators are illegal. For example, a column name $name$1$ will be treated literally, unless there is also a column with the name $name$0$.
The order of columns has no significance.
Rows in the table correspond to individual top-level items of the same type. There is no requirement for nested structures to be of the same type or number across table rows.

Examples

$$: literal
$$$: literal
$$$$: literal
$0: literal, no top-level fieldname
$$subfield: literal, no top-level fieldname
$top$0$$sub$0$0$$bottom$ ->

    {
        "top": [
            {
                "sub": [
                    [
                        {
                            "bottom": <value>
                        }
                    ]
                ]
            }
        ]
    }

Why not RFC6901?

RFC6901 is the specification of a "JSON Pointer". This is a related solution to the problem of determining a single string (a "column header") that encodes an arbitrary position in an arbitrary JSON data structure.

However, the aims are different. RFC6901 does not allow for converting a JSON data structure to a table (using JSON pointers as column headers) and back to the same JSON structure. This isn't possible, because the data type of individual elements is not encoded in the pointer. Pointers are furthermore designed to be parts of URL, hence do not permit unencoded characters that cannot be part of URLs. For a spreadsheet in UTF-8 encoding, this is an unnecessary constraint.