1. Introduction

What This Project Does

spreadsheet-handling is a Python toolkit for roundtrip workflows between structured data and human-editable spreadsheets. A configurable pipeline projects data into spreadsheet form for review and editing, then re-imports it deterministically. Semantics are kept explicit: cross-references, coded cell values, and FK helper columns are all configuration-driven, not inferred from sheet layout.

Two Repositories

The library (spreadsheet-handling) and a demo integration (spreadsheet-handling-demo) are separate repositories. The demo is the main user entry point and contains runnable examples and sample data.

This guide documents the framework surface shipped with a specific library release.

Scope of This Guide
  • Pipeline workflow overview and starting points

  • Transformation catalog: step names, parameters, roundtrip guarantees

Demo data, runnable scripts, and integration walkthroughs live in the demo repository.

2. Workflow

The Core Idea

A pipeline transforms data bidirectionally between a normalized relation form (YAML-friendly) and a human-editable spreadsheet. The same configuration drives both directions, so re-import is deterministic.

Practical workflows tend to grow in rings: start with a plain tabular roundtrip (data in → spreadsheet out → back in), then add FK helper columns for display, then introduce cross-table or compact-cell encoding for multi-valued relations. Each ring adds one transform pair from the catalog.

A runnable walkthrough of this spiral is available in the spreadsheet-handling-demo repository.

Pipeline YAML Structure

A pipeline is a YAML list of step entries. Each entry names a registered transform and supplies its parameters inline:

pipeline:
  - step: expand_xref
    matrix: role_permissions
    output: permission_rows
    row_keys: role_id
    drop_empty: true

Available step names, required parameters, and roundtrip guarantees are listed in the Transformation Catalog.

Running a Pipeline

sheets-run is the primary runtime command for executing a documented pipeline. For an ad-hoc pipeline file like the one above, provide the steps file and input/output locations. The pipeline steps transform the loaded frames; the CLI input and output options choose how those frames are read and written:

sheets-run --steps pipeline.yaml \
  --in-kind yaml_dir --in-path data/input \
  --out-kind yaml_dir --out-path build/output

The spreadsheet-handling-demo repository contains the full first-hour walkthrough with checked-in input data, pipeline files, and expected outputs. The user guide focuses on the configuration concepts and transform catalog rather than duplicating that tutorial.

One-Shot Schema Column Maintenance

Use sheets-schema-maintain when you need an explicit, reviewable column schema change over loaded Frames: add a column, drop a column, rename a column, or reorder columns. It is a maintenance command, not a normal pipeline step. Use ordinary roundtrip editing when the change belongs in the spreadsheet payload itself; use schema maintenance when the column contract has changed and you want the tool to check recognized metadata references before writing output.

The command is dry-run by default. In dry-run mode the loaded Frames pass through the orchestrator and then into the output-only discard saver, so no output artifact is written. Add --write to allow a user-selected output write; write mode requires both --out-kind and --out-path.

Every run emits a machine-readable JSON report outside Frames and outside _meta. If --report is omitted, the report is printed to stdout. If --report is supplied, the report is written to that path. Blocking failures stop the write before the saver runs, but the report is still emitted.

Supported operations:

  • add_column

  • drop_column

  • rename_column

  • reorder_columns

Metadata handling is deliberately conservative. Recognized metadata references are updated, pruned, or blocked according to explicit rules. Ambiguous, malformed, or unsupported structured references block the operation. Unknown plugin-owned metadata is not string-searched or heuristically rewritten; it blocks only when a structured frame/column reference is detected. _meta.derived is non-authoritative and does not drive schema decisions.

Known limitations:

  • no public sheets-run YAML step and no public pipeline registry step;

  • no generic migration framework;

  • no workbook-header inference;

  • no dynamic projection schema handling;

  • FK relation drops remain blocked.

Dry-run rename with the report printed to stdout:

sheets-schema-maintain \
  --in-kind json_dir --in-path data/input \
  --op rename_column \
  --frame characters \
  --source-column name \
  --target-column display_name

Write-mode rename from json_dir to json_dir with a report file:

sheets-schema-maintain \
  --in-kind json_dir --in-path data/input \
  --write \
  --out-kind json_dir --out-path build/schema-updated \
  --op rename_column \
  --frame characters \
  --source-column name \
  --target-column display_name \
  --report build/schema-maintenance-report.json

Drop with explicit pruning of supported references:

sheets-schema-maintain \
  --in-kind json_dir --in-path data/input \
  --write \
  --out-kind json_dir --out-path build/schema-updated \
  --op drop_column \
  --frame characters \
  --source-column obsolete_note \
  --prune \
  --report build/drop-obsolete-note-report.json

--prune only removes explicitly supported references. Unsupported structured references still block, including FK relations that touch the dropped column.

3. Transformation Catalog

Reading This Catalog

Step names used in pipeline YAML must match the runtime_name values in pipeline_step_registry.json (under core/registries/), which is the authoritative registry artifact.

Inverse kinds
bijective

Fully reversible. The inverse step reconstructs the original exactly.

conditionally bijective

Reversible when the stated conditions hold (see each entry).

relation-preserving

No data is lost, but column order or frame shape may change.

projection-only

One-way; the inverse step cannot reconstruct the original.

validation-only

Read-only; no frames are changed.

configuration-only

Writes workbook metadata only; no data frames are changed.

Groups
Bijective transforms

Paired expand/contract steps that form the core of the bidirectional workflow.

Supporting steps

FK helpers, validation, configuration, and infrastructure steps listed in a reference table.

Bijective Transforms

These steps come in expand/contract pairs. The expand direction converts a compact or matrix representation into an explicit relation (long form). The contract direction reverses it. Losslessness is conditional — see each entry.

expand_xref / contract_xref

Category: primitive · Inverse: conditionally bijective

Expand a matrix frame (rows × columns of marker cells) into explicit long-form relation rows, or contract them back. Typical use: a role × permission table where each cell marks whether the role has the permission.

See also: concept doc technical_model/ch04_concepts/crosstab_binary_xref.adoc and compact_transforms.adoc.

expand_xref parameters

Parameter Required Description

matrix

yes

Name of the input matrix frame.

output

yes

Name of the output relation frame to produce.

row_keys

yes

Column name (string) or list of column names that form the row identifier.

value_columns

no

Columns to expand. Defaults to all non-key columns.

column_key

no

Field name in the output relation that stores the column-axis address value (default: column_key). This is a field name, not a list of axis values — see column_keys on contract_xref for that.

value

no

Column name in the output holding the cell value (default: value).

drop_empty

no

If true, omit rows where the cell value is empty or blank. Default: false. With base_relation: a blank in-scope cell removes the matching base tuple. Without base_relation: a blank cell either produces a tuple with empty value (false) or is omitted (true).

base_relation

no

Optional name of an existing relation frame for scoped recomposition. When absent, expand_xref converts the matrix to a new relation (standard mode). When present, the matrix cells define the update scope and the matching base-relation tuples are replaced or removed; relation rows whose addresses fall outside the matrix scope are preserved and appended to the output. First-run pipelines with no prior relation omit this parameter.

name

no

Override the step name shown in logs.

Losslessness condition: lossless when drop_empty: false, or when the original matrix had no empty cells in the expanded columns.

- step: expand_xref
  matrix: role_permissions
  output: permission_rows
  row_keys: role_id
  drop_empty: true

contract_xref parameters

Parameter Required Description

relation

yes

Name of the input long-form relation frame.

output

yes

Name of the output matrix frame to produce.

row_keys

yes

Column name(s) forming the row identifier (must match expand_xref config).

column_key

no

Field name in the relation that holds the column-axis address values. Defaults to the value stored in meta by the preceding expand_xref (or the literal column_key). This is a field name, not a list of axis values — see column_keys below.

value

no

Column in the relation holding cell values. Defaults to meta-persisted value.

column_keys

no

Explicit ordered list of column-axis values that become the visible matrix columns. (column_key names the field; column_keys lists the values in that field.) When supplied together with dense_axes.columns_from, both must resolve to the same ordered list — a mismatch fails fast.

fill_value

no

Value to place in cells with no matching relation row. Default: empty string.

dense_axes

no

Map with optional rows_from and/or columns_from keys. Each specifies an entity frame and key field whose unique values drive the corresponding axis. When columns_from is supplied, the resolved value list must match any explicit column_keys exactly (consistency contract, not precedence).

name

no

Override the step name shown in logs.

Losslessness condition: lossless when relation rows are unique per (row_keys, column_key) combination.

- step: contract_xref
  relation: permission_rows
  output: role_permissions
  row_keys: role_id
column_key vs column_keys

column_key and column_keys look similar but carry different information:

Parameter Meaning

column_key

Field name — the name of the field in the long/relation frame that holds the column-axis address (e.g. character_name, context_id, product_id).

column_keys

Value list — an explicit, ordered list of concrete axis values that become the visible matrix columns (e.g. [Galli, Donzo, Rexi]).

Example: character_name is the field name; Galli, Donzo, Rexi are the values.

- step: contract_xref
  relation: story_cast_role_named
  output: story_cast_role_matrix
  row_keys: [story_id]
  column_key: character_name      # field name in the relation
  column_keys: [Galli, Donzo, Rexi]   # concrete axis values
  value: value

column_key does not have to be a technical primary key. A configured unique alternate key (such as character_name) is valid when it is declared unique.

Column-axis key sources (contract_xref only):

  1. column_keys only → list used directly.

  2. dense_axes.columns_from only → list resolved from the entity frame at runtime.

  3. Both supplied → must match exactly; a mismatch fails fast (consistency contract, not precedence).

  4. Neither → falls back to column_keys stored in meta from a prior run, then to relation column order.

Sparse-default cells and expand_xref

expand_xref does not interpret sparse-default metadata. If blank cells in the matrix represent a compressed default (introduced by sparse_collapse), run sparse_expand before expand_xref to restore explicit default values. sparse_collapse / sparse_expand are separate transformations with their own registered steps.


decode_cell_values / encode_cell_values

Category: primitive · Inverse: conditionally bijective

Decode a compact cell string (e.g. "A,B,C") into one row per code, or encode code rows back into a compact cell string. Used inside the compact multiaxis chain or independently for coded single-column cells.

See also: concept doc technical_model/ch04_concepts/compact_transforms.adoc.

decode_cell_values parameters

Parameter Required Description

source

yes

Name of the input frame containing encoded cell strings.

output

yes

Name of the output frame with one row per decoded token.

value

no

Column in source holding the encoded string. Default: value.

code

no

Column name in output for the decoded token. Default: code.

passthrough_columns

no

Columns from source to copy unchanged into every output row.

drop_empty

no

If true, omit rows for empty tokens. Default: false.

mode

no

Decoding mode. delimited (default) splits on delimiter; positional treats each character as a code.

delimiter

no

Token separator for delimited mode. Default: ,.

allowed_codes

no

Explicit list of valid code strings. Validation fails on unknown codes.

allowed_tokens

no

Alias for allowed_codes.

allowed_from_legend

no

Name of a legend frame from which to derive the allowed-code set.

normalize_case

no

upper, lower, or omit to leave case unchanged.

strip

no

If true, strip whitespace from each token. Default: false.

name

no

Override the step name shown in logs.

Losslessness condition: lossless when drop_empty: false and token order is preserved (or canonical_order is set on the encode side).

- step: decode_cell_values
  source: assignments
  output: assignment_codes
  value: codes
  code: code
  delimiter: ","
  passthrough_columns: [employee_id, project_id]

encode_cell_values parameters

Parameter Required Description

source

yes

Name of the input frame with one row per code.

output

yes

Name of the output frame with encoded cell strings.

group_by

yes

Column name(s) to group rows before joining into a cell string.

code

no

Column containing the token to encode. Default: code.

value

no

Column name in output for the encoded string. Default: value.

mode

no

Encoding mode. Matches decode_cell_values mode.

delimiter

no

Separator. Default: ,.

canonical_order

no

Explicit ordered list of codes; determines output token order.

allowed_codes / allowed_tokens

no

Validation whitelist.

allowed_from_legend

no

Legend frame name for allowed-code derivation.

normalize_case

no

upper, lower, or omit.

strip

no

Strip whitespace before encoding.

name

no

Override the step name shown in logs.

- step: encode_cell_values
  source: assignment_codes
  output: assignments
  group_by: [employee_id, project_id]
  code: code
  value: codes
  delimiter: ","

expand_compact_multiaxis / contract_compact_multiaxis

Category: composite · Inverse: conditionally bijective

Compose expand_xref and decode_cell_values (or their inverses) to handle compact multi-axis matrices where each cell encodes multiple coded values. Parameters are a superset of the two constituent steps.

Full parameter documentation planned.


split_by_discriminator / merge_by_discriminator

Category: primitive · Inverse: conditionally bijective

Split one frame into multiple named frames partitioned by a discriminator column value, recording frame-name → discriminator mapping in meta. merge_by_discriminator reconstructs the original frame from the split parts.

Full parameter documentation planned.


sparse_collapse / sparse_expand

Category: primitive · Inverse: conditionally bijective

Replace configured default values in editable crosstable cells with a sparse blank placeholder (sparse_collapse), or restore blanks to the configured default (sparse_expand). Used to make sparse matrices more readable in spreadsheet form.

Full parameter documentation planned.


flatten_headers / unflatten_headers

Category: primitive · Inverse: conditionally bijective

Flatten MultiIndex or tuple-like DataFrame column headers into scalar string labels (flatten_headers), or split scalar string headers back into MultiIndex form (unflatten_headers).

Full parameter documentation planned.

Supporting Steps

The following steps handle FK helpers, validation, configuration, and infrastructure concerns. Full parameter documentation is planned; this table gives the step name, category, inverse kind, and purpose for quick reference.

FK Helpers and Lookup

FK-helper primitives are policy consumers: they read the v2 relation policy under meta.helper_policies.fk (and derived helper provenance under _meta.derived.sheets.*.helper_columns). Either configure_fk_helpers (explicit / manual policy) or infer_fk_relations (heuristic policy) must run before add_fk_helpers, reorder_fk_helpers, validate_fk_helpers, and remove_fk_helpers. Missing policy is a clear error that names the producer step; primitives no longer infer relations from column names. The cell-level id(<target>) convention is the wire format on roundtrip and the default heuristic of infer_fk_relations; it is no longer interpreted by primitives.

Step Category Inverse Purpose

infer_fk_relations

configuration

configuration-only

Heuristic producer: scan data frames and write v2 FK relation policy under meta.helper_policies.fk with produced_by.mode: naming_convention (currently the only mode). Use it when FK columns follow the id(<target>) convention.

configure_fk_helpers

configuration

configuration-only

Explicit / manual producer: take per-target policy from the YAML and write the v2 relation list under _meta.helper_policies.fk with produced_by.mode: explicit. Use it when you want to spell relations, helper fields, and target_key out by hand.

add_fk_helpers

primitive

projection-only

Materialize FK helper columns from the v2 relation policy. Requires that configure_fk_helpers or infer_fk_relations ran earlier. Writes derived helper provenance for later cleanup and validation.

reorder_fk_helpers

primitive

relation-preserving

Move FK helper columns next to their FK column using derived helper provenance and the v2 relation policy. Does not consult FK_PATTERN.

validate_fk_helpers

primitive

validation-only

Validate FK-helper consistency against the v2 relation policy and derived helper provenance (no convention inference).

remove_fk_helpers

primitive

projection-only

Remove materialized FK helper columns using the derived provenance written by add_fk_helpers. Frames without provenance or explicit policy are an error. Recovery from externally-edited workbooks without provenance is a separate opt-in concern and is not part of this step.

add_lookup_helpers

primitive

projection-only

Join a source frame to a lookup frame and materialize helper columns.

configure_lookup_helpers

configuration

configuration-only

Resolve lookup-helper policy and persist it as canonical metadata.

Validation
Step Category Inverse Purpose

add_validations

configuration

validation-only

Write validation constraints into workbook metadata for later rendering.

validate_references

primitive

validation-only

Validate PK, uniqueness, FK, reference-tuple integrity, and no_helper_columns assertion rules against frame data.

validate_graph

primitive

validation-only

Validate node/edge networks for endpoint existence and duplicate edge identities.

validate

primitive

validation-only

Validate duplicate IDs and unresolved FK references with severity policy. (transitional status — prefer validate_references)

add_validations supports explicit column targets and role-based targets. For dynamic workbook views, prefer an explicit frame: when using roles:. The sheet: field names the visible workbook sheet where the validation is rendered. The frame: field names the frame used for column-role resolution. In projected workbook views, this is usually the semantic source frame that carries row-key, xref, helper-column, or other role metadata; it is not necessarily the final display/helper frame used to render the visible sheet. frame: may be omitted when the visible sheet and role-resolution frame are unambiguous, but keeping it explicit is recommended for consumer pipelines that use projected workbook views.

- step: add_validations
  rules:
    - target:
        sheet: story_groups
        frame: story_group_matrix
        roles: [matrix_value]
      rule:
        type: from_legend
        legend: story_group_codes
        include_empty: true

roles: accepts the foundation roles row_identity, display_helper, and matrix_value. It is mutually exclusive with explicit column: or columns: targets. Use roles: [matrix_value] for dynamic matrix value columns that previously required consumer-side fan-out plugins such as add_from_legend_constraints_for_matrix.

Configuration
Step Category Inverse Purpose

configure_workbook_view

configuration

configuration-only

Declare which frames render as workbook sheets (names, order, lifecycle, options).

apply_workbook_view_sheet_mappings

primitive

relation-preserving

Re-key visible sheets to logical workbook-view frames on readback using persisted mappings.

apply_derived_column_policy

primitive

projection-only

Drop helper/derived columns using registered _meta.derived provenance, falling back to durable workbook-view helper declarations (_meta.sheets.<frame>.helper_columns) for drop identity after workbook reimport. Pair with validate_references / no_helper_columns when the pipeline should assert that cleanup happened before writing canonical data.

Infrastructure and Output
Step Category Inverse Purpose

bootstrap_meta

infrastructure

Merge profile defaults, persisted metadata, and runtime overrides into payload metadata.

apply_overrides

infrastructure

Load or apply workbook and sheet override metadata.

extract_frame

primitive

extraction

Derive a named frame by selecting columns, filtering rows, renaming, adding constants.

pivot_frame

primitive

projection-only

Derive a display frame by pivoting long-form rows into configured index and value columns.

join_frames

primitive

projection-only

Derive a view frame by joining two frames with configured keys, mode, and projections.

normalize_resource_overrides

primitive

projection-only

Normalize long-form resource override tuples according to a default-context and empty-value policy.

write_structured_yaml

infrastructure

extraction

Write nested YAML files from tabular frames and record a generated-file report frame.

write_key_value_resources

infrastructure

projection-only

Write key-value (.properties-style) files partitioned by template column values.

write_artifact_manifest

infrastructure

projection-only

Merge writer report frames into a deterministic manifest; optionally compute checksums.

identity

infrastructure

bijective

Return the pipeline payload unchanged. Useful as a no-op placeholder.

plugin

plugin

Run a caller-provided dotted callable as an explicit pipeline extension point.