feat(samples): rebuild and modernise the tpch_sample #108
Open
rederik76 wants to merge 4 commits into
Open
Conversation
…on reference The tpch_sample bundle was started long ago but never finished. This revisits and completes it as the framework's most comprehensive sample, built on the samples.tpch dataset and aligned with feature-samples / pattern-samples conventions. Highlights: - Multi-source, schema-on-read bronze: Parquet staging from 8 simulated source systems (one bronze schema per system) ingested via Auto Loader with schema inference + evolution. - Conformed, historized silver: SCD2 dimensions, SCD1 reference data, append-only facts, and data-quality expectations with quarantine (quarantineMode: table). - Governed gold star schema: xxhash64 surrogate keys, point-in-time (as-of) joins, dim_date fiscal calendar, row tracking for incremental MV refresh, and two metrics approaches side by side (pre-aggregated MVs + UC metric views). - Template specs: one bronze ingestion template and per-archetype silver templates (scd2 / scd1 / append) to collapse repetitive flows. - Three-run incremental simulation: full refresh then incremental Runs 2-3 covering SCD2 changes, fact growth, and a backdated out-of-order correction. - Tier-3 messiness: schema evolution (new loyalty_tier column auto-evolved in bronze), CDC deletes/tombstones via apply_as_deletes (cdc_operation flag; Run 3 supplier delete), and late-arriving dimensions with an unknown member (-1) on MV dims + COALESCE in the fact. - Orchestration + docs: dedicated deploy_and_test_tpch.sh / deploy_tpch.sh / destroy_tpch.sh, a full README (design choices, setup, demo flow, validation queries), and a planning doc. Removed the legacy WIP scaffolding (old per-entity bronze specs, schema-on-read schema files, stale resources, pytest fixtures). Validated end-to-end on e2-demo-field-eng (serverless, catalog main, _es): deploy + setup + Run 1/2/3 all green, with data-level checks confirming schema evolution, supplier tombstone, late-arriving part unknown-member resolution, and DQ quarantine capture.
…t.sh; bump version to v0.18.0 - Rename the tpch deploy+test orchestrator for naming consistency and update all references (script usage header + tpch_sample README). - Bump VERSION to v0.18.0. - Minor docs wording fix in what_is_lakeflow_framework.rst (SDP -> Spark Declarative Pipelines).
Deploy a Genie space over the tpch gold schema for natural-language analytics, wired as a post-pipeline notebook task in Run 1 (after create_metric_views). The space registers the gold facts, dimensions, and both UC metric views, with worked example SQL demonstrating both star-schema joins and MEASURE()-based metric-view queries. Genie deployment is optional and never blocks the sample: - new optional `warehouse_id` bundle variable (default empty = skip) - deploy scripts prompt for it interactively (TTY only); leaving it blank skips Genie - create_genie_space notebook no-ops with a clear message when no warehouse is supplied, and is idempotent (find-or-update the space by title) - destroy_tpch.sh trashes matching Genie spaces via the CLI during teardown Validated end-to-end (deploy + setup + runs 1-3 incl. metric views and Genie) on serverless with the quarantine double-qualification fix deployed. README updated with the optional-warehouse prerequisite, Day-1 demo flow, cleanup note, and design rationale.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Rebuilds the long-unfinished
tpch_samplebundle into the framework's most comprehensive, end-to-end medallion reference. The sample was started long ago but never completed; this revisits it and brings it up tofeature-samples/pattern-samplesstandards, turning thesamples.tpchdataset into a fully streaming medallion warehouse from multi-source ingestion through a governed gold star schema.What's included
quarantineMode: table).xxhash64surrogate keys, point-in-time (as-of) joins,dim_datefiscal calendar, row tracking for incremental MV refresh, and two metrics approaches side by side (pre-aggregated MVs + UC metric views).loyalty_tiercolumn auto-evolved in bronze), CDC deletes/tombstones viaapply_as_deletes(cdc_operationflag; Run 3 supplier delete), and late-arriving dimensions with an unknown member (-1) on the MV dims +COALESCEin the fact.deploy_and_test_tpch.sh/deploy_tpch.sh/destroy_tpch.sh, a fullREADME.md(background, design choices, setup, demo flow, validation queries)Removed
Legacy WIP scaffolding: old per-entity bronze specs, schema-on-read schema files, snapshot-fact specs/DML, stale resource layout, and pytest fixtures.
Notes
tpch_sampleremains separate from the maindeploy.sh/deploy_and_test.sh(feature + pattern samples), with its own orchestrator that reusescommon.sh.Test plan
deploy_and_test_tpch.sh— deploy + setup + Run 1/2/3 allTERMINATED SUCCESS(no pipeline/expectation errors)loyalty_tierpresent inbronze_crm.customer; only Run-2 demo rows populated, baseline NULLsilver.suppliershows a closed__END_AT(tombstone) and 0 open versions9000001infct_order_lineshas a-1row (Run 2) and a realpart_sk(Run 3); unknown member present indim_part/dim_supplier/dim_locationcustomer_address_quarantineandorders_quarantineacross runs