Skip to content

feat(samples): rebuild and modernise the tpch_sample #108

Open
rederik76 wants to merge 4 commits into
mainfrom
feature/tpch-sample-rebuild
Open

feat(samples): rebuild and modernise the tpch_sample #108
rederik76 wants to merge 4 commits into
mainfrom
feature/tpch-sample-rebuild

Conversation

@rederik76

@rederik76 rederik76 commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator

Summary

Rebuilds the long-unfinished tpch_sample bundle into the framework's most comprehensive, end-to-end medallion reference. The sample was started long ago but never completed; this revisits it and brings it up to feature-samples / pattern-samples standards, turning the samples.tpch dataset into a fully streaming medallion warehouse from multi-source ingestion through a governed gold star schema.

What's included

  • Multi-source, schema-on-read bronze — Parquet staging from 8 simulated source systems (one bronze schema per system), ingested via Auto Loader with schema inference + evolution.
  • Conformed, historized silver — SCD2 dimensions, SCD1 reference data, append-only facts, and data-quality expectations with quarantine (quarantineMode: table).
  • Governed gold star schemaxxhash64 surrogate keys, point-in-time (as-of) joins, dim_date fiscal calendar, row tracking for incremental MV refresh, and two metrics approaches side by side (pre-aggregated MVs + UC metric views).
  • Template specs — one bronze ingestion template and per-archetype silver templates (SCD2 / SCD1 / append) to collapse repetitive flows.
  • Three-run incremental simulation — full refresh, then incremental Runs 2–3 covering SCD2 changes, fact growth, and a backdated out-of-order correction.
  • "Real world complexity" — schema evolution (new loyalty_tier column auto-evolved in bronze), CDC deletes/tombstones via apply_as_deletes (cdc_operation flag; Run 3 supplier delete), and late-arriving dimensions with an unknown member (-1) on the MV dims + COALESCE in the fact.
  • Orchestration + docs — dedicated deploy_and_test_tpch.sh / deploy_tpch.sh / destroy_tpch.sh, a full README.md (background, design choices, setup, demo flow, validation queries)

Removed

Legacy WIP scaffolding: old per-entity bronze specs, schema-on-read schema files, snapshot-fact specs/DML, stale resource layout, and pytest fixtures.

Notes

  • tpch_sample remains separate from the main deploy.sh / deploy_and_test.sh (feature + pattern samples), with its own orchestrator that reuses common.sh.

Test plan

  • deploy_and_test_tpch.sh — deploy + setup + Run 1/2/3 all TERMINATED SUCCESS (no pipeline/expectation errors)
  • Schema evolutionloyalty_tier present in bronze_crm.customer; only Run-2 demo rows populated, baseline NULL
  • Supplier delete — supplier 3 in silver.supplier shows a closed __END_AT (tombstone) and 0 open versions
  • Late-arriving dimension — part 9000001 in fct_order_lines has a -1 row (Run 2) and a real part_sk (Run 3); unknown member present in dim_part / dim_supplier / dim_location
  • DQ quarantine — rows captured in customer_address_quarantine and orders_quarantine across runs
  • Both metric views and all pre-aggregated MVs build in Run 1
  • Reviewer: spot-check the README demo flow / validation queries against a fresh deploy

…on reference

The tpch_sample bundle was started long ago but never finished. This revisits and
completes it as the framework's most comprehensive sample, built on the samples.tpch
dataset and aligned with feature-samples / pattern-samples conventions.

Highlights:
- Multi-source, schema-on-read bronze: Parquet staging from 8 simulated source systems
  (one bronze schema per system) ingested via Auto Loader with schema inference + evolution.
- Conformed, historized silver: SCD2 dimensions, SCD1 reference data, append-only facts,
  and data-quality expectations with quarantine (quarantineMode: table).
- Governed gold star schema: xxhash64 surrogate keys, point-in-time (as-of) joins,
  dim_date fiscal calendar, row tracking for incremental MV refresh, and two metrics
  approaches side by side (pre-aggregated MVs + UC metric views).
- Template specs: one bronze ingestion template and per-archetype silver templates
  (scd2 / scd1 / append) to collapse repetitive flows.
- Three-run incremental simulation: full refresh then incremental Runs 2-3 covering SCD2
  changes, fact growth, and a backdated out-of-order correction.
- Tier-3 messiness: schema evolution (new loyalty_tier column auto-evolved in bronze),
  CDC deletes/tombstones via apply_as_deletes (cdc_operation flag; Run 3 supplier delete),
  and late-arriving dimensions with an unknown member (-1) on MV dims + COALESCE in the fact.
- Orchestration + docs: dedicated deploy_and_test_tpch.sh / deploy_tpch.sh / destroy_tpch.sh,
  a full README (design choices, setup, demo flow, validation queries), and a planning doc.

Removed the legacy WIP scaffolding (old per-entity bronze specs, schema-on-read schema
files, stale resources, pytest fixtures).

Validated end-to-end on e2-demo-field-eng (serverless, catalog main, _es): deploy + setup +
Run 1/2/3 all green, with data-level checks confirming schema evolution, supplier tombstone,
late-arriving part unknown-member resolution, and DQ quarantine capture.
…t.sh; bump version to v0.18.0

- Rename the tpch deploy+test orchestrator for naming consistency and update all
  references (script usage header + tpch_sample README).
- Bump VERSION to v0.18.0.
- Minor docs wording fix in what_is_lakeflow_framework.rst (SDP -> Spark Declarative Pipelines).
Deploy a Genie space over the tpch gold schema for natural-language analytics,
wired as a post-pipeline notebook task in Run 1 (after create_metric_views). The
space registers the gold facts, dimensions, and both UC metric views, with worked
example SQL demonstrating both star-schema joins and MEASURE()-based metric-view
queries.

Genie deployment is optional and never blocks the sample:
- new optional `warehouse_id` bundle variable (default empty = skip)
- deploy scripts prompt for it interactively (TTY only); leaving it blank skips Genie
- create_genie_space notebook no-ops with a clear message when no warehouse is
  supplied, and is idempotent (find-or-update the space by title)
- destroy_tpch.sh trashes matching Genie spaces via the CLI during teardown

Validated end-to-end (deploy + setup + runs 1-3 incl. metric views and Genie) on
serverless with the quarantine double-qualification fix deployed.

README updated with the optional-warehouse prerequisite, Day-1 demo flow, cleanup
note, and design rationale.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE]: Modernise the tpch_sample bundle into a comprehensive end-to-end medallion reference

1 participant