Project spreadsheet-handling

1. Introduction

What This Project Does

spreadsheet-handling is a Python toolkit for roundtrip workflows between structured data and human-editable spreadsheets. A configurable pipeline projects data into spreadsheet form for review and editing, then re-imports it deterministically. Semantics are kept explicit: cross-references, coded cell values, and FK helper columns are all configuration-driven, not inferred from sheet layout.

Two Repositories

The library (spreadsheet-handling) and a demo integration (spreadsheet-handling-demo) are separate repositories. The demo is the main user entry point and contains runnable examples and sample data.

This guide documents the framework surface shipped with a specific library release.

Scope of This Guide

Pipeline workflow overview and starting points
Transformation catalog: step names, parameters, roundtrip guarantees

Demo data, runnable scripts, and integration walkthroughs live in the demo repository.

2. Workflow

The Core Idea

A pipeline transforms data bidirectionally between a normalized relation form (YAML-friendly) and a human-editable spreadsheet. The same configuration drives both directions, so re-import is deterministic.

Practical workflows tend to grow in rings: start with a plain tabular roundtrip (data in → spreadsheet out → back in), then add FK helper columns for display, then introduce cross-table or compact-cell encoding for multi-valued relations. Each ring adds one transform pair from the catalog.

A runnable walkthrough of this spiral is available in the spreadsheet-handling-demo repository.

Pipeline YAML Structure

A pipeline is a YAML list of step entries. Each entry names a registered transform and supplies its parameters inline:

pipeline:
  - step: expand_xref
    matrix: role_permissions
    output: permission_rows
    row_keys: role_id
    drop_empty: true

Available step names, required parameters, and roundtrip guarantees are listed in the Transformation Catalog.

Running a Pipeline

sheets-run is the primary runtime command for executing a documented pipeline. For an ad-hoc pipeline file like the one above, provide the steps file and input/output locations. The pipeline steps transform the loaded frames; the CLI input and output options choose how those frames are read and written:

sheets-run --steps pipeline.yaml \
  --in-kind yaml_dir --in-path data/input \
  --out-kind yaml_dir --out-path build/output

The spreadsheet-handling-demo repository contains the full first-hour walkthrough with checked-in input data, pipeline files, and expected outputs. The user guide focuses on the configuration concepts and transform catalog rather than duplicating that tutorial.

3. Transformation Catalog

Reading This Catalog

Step names used in pipeline YAML must match the runtime_name values in pipeline_step_registry.json (under core/registries/), which is the authoritative registry artifact.

Inverse kinds

bijective	Fully reversible. The inverse step reconstructs the original exactly.
conditionally bijective	Reversible when the stated conditions hold (see each entry).
relation-preserving	No data is lost, but column order or frame shape may change.
projection-only	One-way; the inverse step cannot reconstruct the original.
validation-only	Read-only; no frames are changed.
configuration-only	Writes workbook metadata only; no data frames are changed.

Groups

Bijective transforms: Paired expand/contract steps that form the core of the bidirectional workflow.
Supporting steps: FK helpers, validation, configuration, and infrastructure steps listed in a reference table.

Bijective Transforms

These steps come in expand/contract pairs. The expand direction converts a compact or matrix representation into an explicit relation (long form). The contract direction reverses it. Losslessness is conditional — see each entry.

`expand_xref` / `contract_xref`

Category: primitive · Inverse: conditionally bijective

Expand a matrix frame (rows × columns of marker cells) into explicit long-form relation rows, or contract them back. Typical use: a role × permission table where each cell marks whether the role has the permission.

See also: concept doc technical_model/ch04_concepts/crosstab_binary_xref.adoc and compact_transforms.adoc.

expand_xref parameters

Parameter Required Description

Parameter	Required	Description
`matrix`	yes	Name of the input matrix frame.
`output`	yes	Name of the output relation frame to produce.
`row_keys`	yes	Column name (string) or list of column names that form the row identifier.
`value_columns`	no	Columns to expand. Defaults to all non-key columns.
`column_key`	no	Column name in the output holding the original column label (default: `column_key`).
`value`	no	Column name in the output holding the cell value (default: `value`).
`drop_empty`	no	If `true`, omit rows where the cell value is empty or falsy. Default: `false`.
`name`	no	Override the step name shown in logs.

matrix

yes

Name of the input matrix frame.

output

yes

Name of the output relation frame to produce.

row_keys

yes

Column name (string) or list of column names that form the row identifier.

value_columns

Columns to expand. Defaults to all non-key columns.

column_key

Column name in the output holding the original column label (default: column_key).

value

Column name in the output holding the cell value (default: value).

drop_empty

If true, omit rows where the cell value is empty or falsy. Default: false.

name

Override the step name shown in logs.

Losslessness condition: lossless when drop_empty: false, or when the original matrix had no empty cells in the expanded columns.

- step: expand_xref
  matrix: role_permissions
  output: permission_rows
  row_keys: role_id
  drop_empty: true

contract_xref parameters

Parameter Required Description

Parameter	Required	Description
`relation`	yes	Name of the input long-form relation frame.
`output`	yes	Name of the output matrix frame to produce.
`row_keys`	yes	Column name(s) forming the row identifier (must match `expand_xref` config).
`column_key`	no	Column in the relation holding the column labels. Defaults to value persisted in meta by the preceding `expand_xref`.
`value`	no	Column in the relation holding cell values. Defaults to meta-persisted value.
`column_keys`	no	Explicit ordered list of output columns to generate (pins column order).
`fill_value`	no	Value to place in cells with no matching relation row. Default: empty string.
`name`	no	Override the step name shown in logs.

relation

yes

Name of the input long-form relation frame.

output

yes

Name of the output matrix frame to produce.

row_keys

yes

Column name(s) forming the row identifier (must match expand_xref config).

column_key

Column in the relation holding the column labels. Defaults to value persisted in meta by the preceding expand_xref.

value

Column in the relation holding cell values. Defaults to meta-persisted value.

column_keys

Explicit ordered list of output columns to generate (pins column order).

fill_value

Value to place in cells with no matching relation row. Default: empty string.

name

Override the step name shown in logs.

Losslessness condition: lossless when relation rows are unique per (row_keys, column_key) combination.

- step: contract_xref
  relation: permission_rows
  output: role_permissions
  row_keys: role_id

`decode_cell_values` / `encode_cell_values`

Category: primitive · Inverse: conditionally bijective

Decode a compact cell string (e.g. "A,B,C") into one row per code, or encode code rows back into a compact cell string. Used inside the compact multiaxis chain or independently for coded single-column cells.

See also: concept doc technical_model/ch04_concepts/compact_transforms.adoc.

decode_cell_values parameters

Parameter Required Description

Parameter	Required	Description
`source`	yes	Name of the input frame containing encoded cell strings.
`output`	yes	Name of the output frame with one row per decoded token.
`value`	no	Column in `source` holding the encoded string. Default: `value`.
`code`	no	Column name in `output` for the decoded token. Default: `code`.
`passthrough_columns`	no	Columns from `source` to copy unchanged into every output row.
`drop_empty`	no	If `true`, omit rows for empty tokens. Default: `false`.
`mode`	no	Decoding mode. `delimited` (default) splits on `delimiter`; `positional` treats each character as a code.
`delimiter`	no	Token separator for `delimited` mode. Default: `,`.
`allowed_codes`	no	Explicit list of valid code strings. Validation fails on unknown codes.
`allowed_tokens`	no	Alias for `allowed_codes`.
`allowed_from_legend`	no	Name of a legend frame from which to derive the allowed-code set.
`normalize_case`	no	`upper`, `lower`, or omit to leave case unchanged.
`strip`	no	If `true`, strip whitespace from each token. Default: `false`.
`name`	no	Override the step name shown in logs.

source

yes

Name of the input frame containing encoded cell strings.

output

yes

Name of the output frame with one row per decoded token.

value

Column in source holding the encoded string. Default: value.

code

Column name in output for the decoded token. Default: code.

passthrough_columns

Columns from source to copy unchanged into every output row.

drop_empty

If true, omit rows for empty tokens. Default: false.

mode

Decoding mode. delimited (default) splits on delimiter; positional treats each character as a code.

delimiter

Token separator for delimited mode. Default: ,.

allowed_codes

Explicit list of valid code strings. Validation fails on unknown codes.

allowed_tokens

Alias for allowed_codes.

allowed_from_legend

Name of a legend frame from which to derive the allowed-code set.

normalize_case

upper, lower, or omit to leave case unchanged.

strip

If true, strip whitespace from each token. Default: false.

name

Override the step name shown in logs.

Losslessness condition: lossless when drop_empty: false and token order is preserved (or canonical_order is set on the encode side).

- step: decode_cell_values
  source: assignments
  output: assignment_codes
  value: codes
  code: code
  delimiter: ","
  passthrough_columns: [employee_id, project_id]

encode_cell_values parameters

Parameter Required Description

Parameter	Required	Description
`source`	yes	Name of the input frame with one row per code.
`output`	yes	Name of the output frame with encoded cell strings.
`group_by`	yes	Column name(s) to group rows before joining into a cell string.
`code`	no	Column containing the token to encode. Default: `code`.
`value`	no	Column name in `output` for the encoded string. Default: `value`.
`mode`	no	Encoding mode. Matches `decode_cell_values` mode.
`delimiter`	no	Separator. Default: `,`.
`canonical_order`	no	Explicit ordered list of codes; determines output token order.
`allowed_codes` / `allowed_tokens`	no	Validation whitelist.
`allowed_from_legend`	no	Legend frame name for allowed-code derivation.
`normalize_case`	no	`upper`, `lower`, or omit.
`strip`	no	Strip whitespace before encoding.
`name`	no	Override the step name shown in logs.

source

yes

Name of the input frame with one row per code.

output

yes

Name of the output frame with encoded cell strings.

group_by

yes

Column name(s) to group rows before joining into a cell string.

code

Column containing the token to encode. Default: code.

value

Column name in output for the encoded string. Default: value.

mode

Encoding mode. Matches decode_cell_values mode.

delimiter

Separator. Default: ,.

canonical_order

Explicit ordered list of codes; determines output token order.

allowed_codes / allowed_tokens

Validation whitelist.

allowed_from_legend

Legend frame name for allowed-code derivation.

normalize_case

upper, lower, or omit.

strip

Strip whitespace before encoding.

name

Override the step name shown in logs.

- step: encode_cell_values
  source: assignment_codes
  output: assignments
  group_by: [employee_id, project_id]
  code: code
  value: codes
  delimiter: ","

`expand_compact_multiaxis` / `contract_compact_multiaxis`

Category: composite · Inverse: conditionally bijective

Compose expand_xref and decode_cell_values (or their inverses) to handle compact multi-axis matrices where each cell encodes multiple coded values. Parameters are a superset of the two constituent steps.

Full parameter documentation planned.

`split_by_discriminator` / `merge_by_discriminator`

Category: primitive · Inverse: conditionally bijective

Split one frame into multiple named frames partitioned by a discriminator column value, recording frame-name → discriminator mapping in meta. merge_by_discriminator reconstructs the original frame from the split parts.

Full parameter documentation planned.

`sparse_collapse` / `sparse_expand`

Category: primitive · Inverse: conditionally bijective

Replace configured default values in editable crosstable cells with a sparse blank placeholder (sparse_collapse), or restore blanks to the configured default (sparse_expand). Used to make sparse matrices more readable in spreadsheet form.

Full parameter documentation planned.

`flatten_headers` / `unflatten_headers`

Category: primitive · Inverse: conditionally bijective

Flatten MultiIndex or tuple-like DataFrame column headers into scalar string labels (flatten_headers), or split scalar string headers back into MultiIndex form (unflatten_headers).

Full parameter documentation planned.

Supporting Steps

The following steps handle FK helpers, validation, configuration, and infrastructure concerns. Full parameter documentation is planned; this table gives the step name, category, inverse kind, and purpose for quick reference.

FK Helpers and Lookup

FK-helper primitives are policy consumers: they read the v2 relation policy under meta.helper_policies.fk (and derived helper provenance under _meta.derived.sheets.*.helper_columns). Either configure_fk_helpers (explicit / manual policy) or infer_fk_relations (heuristic policy) must run before add_fk_helpers, reorder_fk_helpers, validate_fk_helpers, and remove_fk_helpers. Missing policy is a clear error that names the producer step; primitives no longer infer relations from column names. The cell-level id(<target>) convention is the wire format on roundtrip and the default heuristic of infer_fk_relations; it is no longer interpreted by primitives.

Step Category Inverse Purpose

Step	Category	Inverse	Purpose
`infer_fk_relations`	configuration	configuration-only	Heuristic producer: scan data frames and write v2 FK relation policy under `meta.helper_policies.fk` with `produced_by.mode: naming_convention` (currently the only mode). Use it when FK columns follow the `id``(<target>)` convention.
`configure_fk_helpers`	configuration	configuration-only	Explicit / manual producer: take per-target policy from the YAML and write the v2 relation list under `_meta.helper_policies.fk` with `produced_by.mode: explicit`. Use it when you want to spell relations, helper fields, and `target_key` out by hand.
`add_fk_helpers`	primitive	projection-only	Materialize FK helper columns from the v2 relation policy. Requires that `configure_fk_helpers` or `infer_fk_relations` ran earlier. Writes derived helper provenance for later cleanup and validation.
`reorder_fk_helpers`	primitive	relation-preserving	Move FK helper columns next to their FK column using derived helper provenance and the v2 relation policy. Does not consult `FK_PATTERN`.
`validate_fk_helpers`	primitive	validation-only	Validate FK-helper consistency against the v2 relation policy and derived helper provenance (no convention inference).
`remove_fk_helpers`	primitive	projection-only	Remove materialized FK helper columns using the derived provenance written by `add_fk_helpers`. Frames without provenance or explicit policy are an error. Recovery from externally-edited workbooks without provenance is a separate opt-in concern and is not part of this step.
`add_lookup_helpers`	primitive	projection-only	Join a source frame to a lookup frame and materialize helper columns.
`configure_lookup_helpers`	configuration	configuration-only	Resolve lookup-helper policy and persist it as canonical metadata.

infer_fk_relations

configuration

configuration-only

Heuristic producer: scan data frames and write v2 FK relation policy under meta.helper_policies.fk with produced_by.mode: naming_convention (currently the only mode). Use it when FK columns follow the id(<target>) convention.

configure_fk_helpers

configuration

configuration-only

Explicit / manual producer: take per-target policy from the YAML and write the v2 relation list under _meta.helper_policies.fk with produced_by.mode: explicit. Use it when you want to spell relations, helper fields, and target_key out by hand.

add_fk_helpers

primitive

projection-only

Materialize FK helper columns from the v2 relation policy. Requires that configure_fk_helpers or infer_fk_relations ran earlier. Writes derived helper provenance for later cleanup and validation.

reorder_fk_helpers

primitive

relation-preserving

Move FK helper columns next to their FK column using derived helper provenance and the v2 relation policy. Does not consult FK_PATTERN.

validate_fk_helpers

primitive

validation-only

Validate FK-helper consistency against the v2 relation policy and derived helper provenance (no convention inference).

remove_fk_helpers

primitive

projection-only

Remove materialized FK helper columns using the derived provenance written by add_fk_helpers. Frames without provenance or explicit policy are an error. Recovery from externally-edited workbooks without provenance is a separate opt-in concern and is not part of this step.

add_lookup_helpers

primitive

projection-only

Join a source frame to a lookup frame and materialize helper columns.

configure_lookup_helpers

configuration

configuration-only

Resolve lookup-helper policy and persist it as canonical metadata.

Validation

Step Category Inverse Purpose

Step	Category	Inverse	Purpose
`add_validations`	configuration	validation-only	Write validation constraints into workbook metadata for later rendering.
`validate_references`	primitive	validation-only	Validate PK, uniqueness, FK, reference-tuple integrity, and `no_helper_columns` assertion rules against frame data.
`validate_graph`	primitive	validation-only	Validate node/edge networks for endpoint existence and duplicate edge identities.
`validate`	primitive	validation-only	Validate duplicate IDs and unresolved FK references with severity policy. (transitional status — prefer `validate_references`)

add_validations

configuration

validation-only

Write validation constraints into workbook metadata for later rendering.

validate_references

primitive

validation-only

Validate PK, uniqueness, FK, reference-tuple integrity, and no_helper_columns assertion rules against frame data.

validate_graph

primitive

validation-only

Validate node/edge networks for endpoint existence and duplicate edge identities.

validate

primitive

validation-only

Validate duplicate IDs and unresolved FK references with severity policy. (transitional status — prefer validate_references)

Configuration

Step Category Inverse Purpose

Step	Category	Inverse	Purpose
`configure_workbook_view`	configuration	configuration-only	Declare which frames render as workbook sheets (names, order, lifecycle, options).
`apply_workbook_view_sheet_mappings`	primitive	relation-preserving	Re-key visible sheets to logical workbook-view frames on readback using persisted mappings.
`apply_derived_column_policy`	primitive	projection-only	Drop helper/derived columns using registered `_meta.derived` provenance, falling back to durable workbook-view helper declarations (`_meta.sheets.<frame>.helper_columns`) for drop identity after workbook reimport. Pair with `validate_references` / `no_helper_columns` when the pipeline should assert that cleanup happened before writing canonical data.

configure_workbook_view

configuration

configuration-only

Declare which frames render as workbook sheets (names, order, lifecycle, options).

apply_workbook_view_sheet_mappings

primitive

relation-preserving

Re-key visible sheets to logical workbook-view frames on readback using persisted mappings.

apply_derived_column_policy

primitive

projection-only

Drop helper/derived columns using registered _meta.derived provenance, falling back to durable workbook-view helper declarations (_meta.sheets.<frame>.helper_columns) for drop identity after workbook reimport. Pair with validate_references / no_helper_columns when the pipeline should assert that cleanup happened before writing canonical data.

Infrastructure and Output

Step Category Inverse Purpose

Step	Category	Inverse	Purpose
`bootstrap_meta`	infrastructure	—	Merge profile defaults, persisted metadata, and runtime overrides into payload metadata.
`apply_overrides`	infrastructure	—	Load or apply workbook and sheet override metadata.
`extract_frame`	primitive	extraction	Derive a named frame by selecting columns, filtering rows, renaming, adding constants.
`pivot_frame`	primitive	projection-only	Derive a display frame by pivoting long-form rows into configured index and value columns.
`join_frames`	primitive	projection-only	Derive a view frame by joining two frames with configured keys, mode, and projections.
`normalize_resource_overrides`	primitive	projection-only	Normalize long-form resource override tuples according to a default-context and empty-value policy.
`write_structured_yaml`	infrastructure	extraction	Write nested YAML files from tabular frames and record a generated-file report frame.
`write_key_value_resources`	infrastructure	projection-only	Write key-value (`.properties`-style) files partitioned by template column values.
`write_artifact_manifest`	infrastructure	projection-only	Merge writer report frames into a deterministic manifest; optionally compute checksums.
`identity`	infrastructure	bijective	Return the pipeline payload unchanged. Useful as a no-op placeholder.
`plugin`	plugin	—	Run a caller-provided dotted callable as an explicit pipeline extension point.

bootstrap_meta

infrastructure

—

Merge profile defaults, persisted metadata, and runtime overrides into payload metadata.

apply_overrides

infrastructure

—

Load or apply workbook and sheet override metadata.

extract_frame

primitive

extraction

Derive a named frame by selecting columns, filtering rows, renaming, adding constants.

pivot_frame

primitive

projection-only

Derive a display frame by pivoting long-form rows into configured index and value columns.

join_frames

primitive

projection-only

Derive a view frame by joining two frames with configured keys, mode, and projections.

normalize_resource_overrides

primitive

projection-only

Normalize long-form resource override tuples according to a default-context and empty-value policy.

write_structured_yaml

infrastructure

extraction

Write nested YAML files from tabular frames and record a generated-file report frame.

write_key_value_resources

infrastructure

projection-only

Write key-value (.properties-style) files partitioned by template column values.

write_artifact_manifest

infrastructure

projection-only

Merge writer report frames into a deterministic manifest; optionally compute checksums.

identity

infrastructure

bijective

Return the pipeline payload unchanged. Useful as a no-op placeholder.

plugin

plugin

—

Run a caller-provided dotted callable as an explicit pipeline extension point.

Project spreadsheet-handling — User Guide

1. Introduction

What This Project Does

Two Repositories

Scope of This Guide

2. Workflow

The Core Idea

Pipeline YAML Structure

Running a Pipeline

3. Transformation Catalog

Reading This Catalog

Inverse kinds

Groups

Bijective Transforms

expand_xref / contract_xref

decode_cell_values / encode_cell_values

expand_compact_multiaxis / contract_compact_multiaxis

split_by_discriminator / merge_by_discriminator

sparse_collapse / sparse_expand

flatten_headers / unflatten_headers

Supporting Steps

FK Helpers and Lookup

Validation

Configuration

Infrastructure and Output

Project `spreadsheet-handling` — User Guide

`expand_xref` / `contract_xref`

`decode_cell_values` / `encode_cell_values`

`expand_compact_multiaxis` / `contract_compact_multiaxis`

`split_by_discriminator` / `merge_by_discriminator`

`sparse_collapse` / `sparse_expand`

`flatten_headers` / `unflatten_headers`