1. Introduction

What This Project Does

spreadsheet-handling is a Python toolkit for roundtrip workflows between structured data and human-editable spreadsheets. A configurable pipeline projects data into spreadsheet form for review and editing, then re-imports it deterministically. Semantics are kept explicit: cross-references, coded cell values, and FK helper columns are all configuration-driven, not inferred from sheet layout.

Two Repositories

The library (spreadsheet-handling) and a demo integration (spreadsheet-handling-demo) are separate repositories. The demo is the main user entry point and contains runnable examples and sample data.

This guide documents the framework surface shipped with a specific library release.

Scope of This Guide
  • Pipeline workflow overview and starting points

  • Transformation catalog: step names, parameters, roundtrip guarantees

Demo data, runnable scripts, and integration walkthroughs live in the demo repository.

2. Workflow

The Core Idea

A pipeline transforms data bidirectionally between a normalized relation form (YAML-friendly) and a human-editable spreadsheet. The same configuration drives both directions, so re-import is deterministic.

Practical workflows tend to grow in rings: start with a plain tabular roundtrip (data in → spreadsheet out → back in), then add FK helper columns for display, then introduce cross-table or compact-cell encoding for multi-valued relations. Each ring adds one transform pair from the catalog.

A runnable walkthrough of this spiral is available in the spreadsheet-handling-demo repository.

Pipeline YAML Structure

A pipeline is a YAML list of step entries. Each entry names a registered transform and supplies its parameters inline:

pipeline:
  - step: expand_xref
    matrix: role_permissions
    output: permission_rows
    row_keys: role_id
    drop_empty: true

Available step names, required parameters, and roundtrip guarantees are listed in the Transformation Catalog.

Running a Pipeline

sheets-run is the primary runtime command for executing a documented pipeline. For an ad-hoc pipeline file like the one above, provide the steps file and input/output locations. The pipeline steps transform the loaded frames; the CLI input and output options choose how those frames are read and written:

sheets-run --steps pipeline.yaml \
  --in-kind yaml_dir --in-path data/input \
  --out-kind yaml_dir --out-path build/output

The spreadsheet-handling-demo repository contains the full first-hour walkthrough with checked-in input data, pipeline files, and expected outputs. The user guide focuses on the configuration concepts and transform catalog rather than duplicating that tutorial.

3. Transformation Catalog

Reading This Catalog

Step names used in pipeline YAML must match the runtime_name values in pipeline_step_registry.json (under core/registries/), which is the authoritative registry artifact.

Inverse kinds
bijective

Fully reversible. The inverse step reconstructs the original exactly.

conditionally bijective

Reversible when the stated conditions hold (see each entry).

relation-preserving

No data is lost, but column order or frame shape may change.

projection-only

One-way; the inverse step cannot reconstruct the original.

validation-only

Read-only; no frames are changed.

configuration-only

Writes workbook metadata only; no data frames are changed.

Groups
Bijective transforms

Paired expand/contract steps that form the core of the bidirectional workflow.

Supporting steps

FK helpers, validation, configuration, and infrastructure steps listed in a reference table.

Bijective Transforms

These steps come in expand/contract pairs. The expand direction converts a compact or matrix representation into an explicit relation (long form). The contract direction reverses it. Losslessness is conditional — see each entry.

expand_xref / contract_xref

Category: primitive · Inverse: conditionally bijective

Expand a matrix frame (rows × columns of marker cells) into explicit long-form relation rows, or contract them back. Typical use: a role × permission table where each cell marks whether the role has the permission.

See also: concept doc technical_model/ch04_concepts/crosstab_binary_xref.adoc and compact_transforms.adoc.

expand_xref parameters

Parameter Required Description

matrix

yes

Name of the input matrix frame.

output

yes

Name of the output relation frame to produce.

row_keys

yes

Column name (string) or list of column names that form the row identifier.

value_columns

no

Columns to expand. Defaults to all non-key columns.

column_key

no

Column name in the output holding the original column label (default: column_key).

value

no

Column name in the output holding the cell value (default: value).

drop_empty

no

If true, omit rows where the cell value is empty or falsy. Default: false.

name

no

Override the step name shown in logs.

Losslessness condition: lossless when drop_empty: false, or when the original matrix had no empty cells in the expanded columns.

- step: expand_xref
  matrix: role_permissions
  output: permission_rows
  row_keys: role_id
  drop_empty: true

contract_xref parameters

Parameter Required Description

relation

yes

Name of the input long-form relation frame.

output

yes

Name of the output matrix frame to produce.

row_keys

yes

Column name(s) forming the row identifier (must match expand_xref config).

column_key

no

Column in the relation holding the column labels. Defaults to value persisted in meta by the preceding expand_xref.

value

no

Column in the relation holding cell values. Defaults to meta-persisted value.

column_keys

no

Explicit ordered list of output columns to generate (pins column order).

fill_value

no

Value to place in cells with no matching relation row. Default: empty string.

name

no

Override the step name shown in logs.

Losslessness condition: lossless when relation rows are unique per (row_keys, column_key) combination.

- step: contract_xref
  relation: permission_rows
  output: role_permissions
  row_keys: role_id

decode_cell_values / encode_cell_values

Category: primitive · Inverse: conditionally bijective

Decode a compact cell string (e.g. "A,B,C") into one row per code, or encode code rows back into a compact cell string. Used inside the compact multiaxis chain or independently for coded single-column cells.

See also: concept doc technical_model/ch04_concepts/compact_transforms.adoc.

decode_cell_values parameters

Parameter Required Description

source

yes

Name of the input frame containing encoded cell strings.

output

yes

Name of the output frame with one row per decoded token.

value

no

Column in source holding the encoded string. Default: value.

code

no

Column name in output for the decoded token. Default: code.

passthrough_columns

no

Columns from source to copy unchanged into every output row.

drop_empty

no

If true, omit rows for empty tokens. Default: false.

mode

no

Decoding mode. delimited (default) splits on delimiter; positional treats each character as a code.

delimiter

no

Token separator for delimited mode. Default: ,.

allowed_codes

no

Explicit list of valid code strings. Validation fails on unknown codes.

allowed_tokens

no

Alias for allowed_codes.

allowed_from_legend

no

Name of a legend frame from which to derive the allowed-code set.

normalize_case

no

upper, lower, or omit to leave case unchanged.

strip

no

If true, strip whitespace from each token. Default: false.

name

no

Override the step name shown in logs.

Losslessness condition: lossless when drop_empty: false and token order is preserved (or canonical_order is set on the encode side).

- step: decode_cell_values
  source: assignments
  output: assignment_codes
  value: codes
  code: code
  delimiter: ","
  passthrough_columns: [employee_id, project_id]

encode_cell_values parameters

Parameter Required Description

source

yes

Name of the input frame with one row per code.

output

yes

Name of the output frame with encoded cell strings.

group_by

yes

Column name(s) to group rows before joining into a cell string.

code

no

Column containing the token to encode. Default: code.

value

no

Column name in output for the encoded string. Default: value.

mode

no

Encoding mode. Matches decode_cell_values mode.

delimiter

no

Separator. Default: ,.

canonical_order

no

Explicit ordered list of codes; determines output token order.

allowed_codes / allowed_tokens

no

Validation whitelist.

allowed_from_legend

no

Legend frame name for allowed-code derivation.

normalize_case

no

upper, lower, or omit.

strip

no

Strip whitespace before encoding.

name

no

Override the step name shown in logs.

- step: encode_cell_values
  source: assignment_codes
  output: assignments
  group_by: [employee_id, project_id]
  code: code
  value: codes
  delimiter: ","

expand_compact_multiaxis / contract_compact_multiaxis

Category: composite · Inverse: conditionally bijective

Compose expand_xref and decode_cell_values (or their inverses) to handle compact multi-axis matrices where each cell encodes multiple coded values. Parameters are a superset of the two constituent steps.

Full parameter documentation planned.


split_by_discriminator / merge_by_discriminator

Category: primitive · Inverse: conditionally bijective

Split one frame into multiple named frames partitioned by a discriminator column value, recording frame-name → discriminator mapping in meta. merge_by_discriminator reconstructs the original frame from the split parts.

Full parameter documentation planned.


sparse_collapse / sparse_expand

Category: primitive · Inverse: conditionally bijective

Replace configured default values in editable crosstable cells with a sparse blank placeholder (sparse_collapse), or restore blanks to the configured default (sparse_expand). Used to make sparse matrices more readable in spreadsheet form.

Full parameter documentation planned.


flatten_headers / unflatten_headers

Category: primitive · Inverse: conditionally bijective

Flatten MultiIndex or tuple-like DataFrame column headers into scalar string labels (flatten_headers), or split scalar string headers back into MultiIndex form (unflatten_headers).

Full parameter documentation planned.

Supporting Steps

The following steps handle FK helpers, validation, configuration, and infrastructure concerns. Full parameter documentation is planned; this table gives the step name, category, inverse kind, and purpose for quick reference.

FK Helpers and Lookup

FK-helper primitives are policy consumers: they read the v2 relation policy under meta.helper_policies.fk (and derived helper provenance under _meta.derived.sheets.*.helper_columns). Either configure_fk_helpers (explicit / manual policy) or infer_fk_relations (heuristic policy) must run before add_fk_helpers, reorder_fk_helpers, validate_fk_helpers, and remove_fk_helpers. Missing policy is a clear error that names the producer step; primitives no longer infer relations from column names. The cell-level id(<target>) convention is the wire format on roundtrip and the default heuristic of infer_fk_relations; it is no longer interpreted by primitives.

Step Category Inverse Purpose

infer_fk_relations

configuration

configuration-only

Heuristic producer: scan data frames and write v2 FK relation policy under meta.helper_policies.fk with produced_by.mode: naming_convention (currently the only mode). Use it when FK columns follow the id(<target>) convention.

configure_fk_helpers

configuration

configuration-only

Explicit / manual producer: take per-target policy from the YAML and write the v2 relation list under _meta.helper_policies.fk with produced_by.mode: explicit. Use it when you want to spell relations, helper fields, and target_key out by hand.

add_fk_helpers

primitive

projection-only

Materialize FK helper columns from the v2 relation policy. Requires that configure_fk_helpers or infer_fk_relations ran earlier. Writes derived helper provenance for later cleanup and validation.

reorder_fk_helpers

primitive

relation-preserving

Move FK helper columns next to their FK column using derived helper provenance and the v2 relation policy. Does not consult FK_PATTERN.

validate_fk_helpers

primitive

validation-only

Validate FK-helper consistency against the v2 relation policy and derived helper provenance (no convention inference).

remove_fk_helpers

primitive

projection-only

Remove materialized FK helper columns using the derived provenance written by add_fk_helpers. Frames without provenance or explicit policy are an error. Recovery from externally-edited workbooks without provenance is a separate opt-in concern and is not part of this step.

add_lookup_helpers

primitive

projection-only

Join a source frame to a lookup frame and materialize helper columns.

configure_lookup_helpers

configuration

configuration-only

Resolve lookup-helper policy and persist it as canonical metadata.

Validation
Step Category Inverse Purpose

add_validations

configuration

validation-only

Write validation constraints into workbook metadata for later rendering.

validate_references

primitive

validation-only

Validate PK, uniqueness, FK, reference-tuple integrity, and no_helper_columns assertion rules against frame data.

validate_graph

primitive

validation-only

Validate node/edge networks for endpoint existence and duplicate edge identities.

validate

primitive

validation-only

Validate duplicate IDs and unresolved FK references with severity policy. (transitional status — prefer validate_references)

Configuration
Step Category Inverse Purpose

configure_workbook_view

configuration

configuration-only

Declare which frames render as workbook sheets (names, order, lifecycle, options).

apply_workbook_view_sheet_mappings

primitive

relation-preserving

Re-key visible sheets to logical workbook-view frames on readback using persisted mappings.

apply_derived_column_policy

primitive

projection-only

Drop helper/derived columns using registered _meta.derived provenance, falling back to durable workbook-view helper declarations (_meta.sheets.<frame>.helper_columns) for drop identity after workbook reimport. Pair with validate_references / no_helper_columns when the pipeline should assert that cleanup happened before writing canonical data.

Infrastructure and Output
Step Category Inverse Purpose

bootstrap_meta

infrastructure

Merge profile defaults, persisted metadata, and runtime overrides into payload metadata.

apply_overrides

infrastructure

Load or apply workbook and sheet override metadata.

extract_frame

primitive

extraction

Derive a named frame by selecting columns, filtering rows, renaming, adding constants.

pivot_frame

primitive

projection-only

Derive a display frame by pivoting long-form rows into configured index and value columns.

join_frames

primitive

projection-only

Derive a view frame by joining two frames with configured keys, mode, and projections.

normalize_resource_overrides

primitive

projection-only

Normalize long-form resource override tuples according to a default-context and empty-value policy.

write_structured_yaml

infrastructure

extraction

Write nested YAML files from tabular frames and record a generated-file report frame.

write_key_value_resources

infrastructure

projection-only

Write key-value (.properties-style) files partitioned by template column values.

write_artifact_manifest

infrastructure

projection-only

Merge writer report frames into a deterministic manifest; optionally compute checksums.

identity

infrastructure

bijective

Return the pipeline payload unchanged. Useful as a no-op placeholder.

plugin

plugin

Run a caller-provided dotted callable as an explicit pipeline extension point.