Migrate docs from MyST to Quarto + prototype auto-generated reference by MaxGhenis · Pull Request #301 · PolicyEngine/policyengine.py

MaxGhenis · 2026-04-20T13:02:15Z

Two changes bundled:

1. MyST → Quarto toolchain swap

Replace `docs/myst.yml` with `docs/_quarto.yml` (same navigation).
Convert `{literalinclude}` directives in `docs/examples.md` to Quarto include syntax (`{.python include="..."}`) — the only MyST-specific construct in the existing docs.
`make docs` now runs `quarto render docs`; `make docs-serve` runs `quarto preview docs`.
`.github/workflows/pr_docs_changes.yaml` installs Quarto via the official action instead of npm-installing mystmd.
Update `docs/.gitignore` for `_site` / `_freeze` instead of `_build`.

All 11 existing MyST pages render cleanly under Quarto without prose changes — CommonMark overlap is near-total.

2. Prototype: auto-generated reference

`docs/_generator/build_reference.py` — introspects a country model's `TaxBenefitSystem` and writes one `.qmd` per variable.
`make docs-generate-reference` target wires it into the build.
Verified on `policyengine-us`: emits ~5k variable pages plus a program-coverage page derived from `programs.yaml`. CHIP filter produces 34 variable pages + 1 programs page + 56 tree indices; Quarto compiles all cleanly.

Generated pages are not yet committed to the repo — the generator is scaffolded, but when to run it (per-release vs in CI) is a follow-up decision.

Why Quarto

One source, multiple outputs (HTML site + PDF paper from the same `.qmd`).
Live code execution during build (tutorials track the installed package).
Computational blocks can call PolicyEngine directly to regenerate reference on each release.
Quarto reads MyST-flavored markdown natively; migration cost for the existing content was ~30 min.

Test plan

`make docs` builds cleanly (11/11 pages render)
CHIP-filtered reference generator runs and renders in Quarto
CI passes

Collapses the household-calculator journey into one obvious call: import policyengine as pe result = pe.us.calculate_household( people=[{"age": 35, "employment_income": 60000}], tax_unit={"filing_status": "SINGLE"}, year=2026, reform={"gov.irs.deductions.standard.amount.SINGLE": 5000}, extra_variables=["adjusted_gross_income"], ) print(result.tax_unit.income_tax, result.tax_unit.adjusted_gross_income) Design goal: a fresh coding session with no prior context and a 20-file browse budget reaches a correct number in two tool calls — one to `import policyengine as pe`, one for `pe.us.calculate_household(...)`. The old surface forced an agent to pick among three entry points (`calculate_household_impact`, `managed_microsimulation`, raw `Simulation`), build a pydantic `Input` wrapper, construct a `Policy` object with `ParameterValue`s, then dig into a `list[dict[str, Any]]` to get the number. Every one of those layers is gone. Changes: - Populate `policyengine/__init__.py` (previously empty) with `us`, `uk`, and `Simulation` accessors. - Add `tax_benefit_models/{us,uk}/household.py` with a kwargs-based `calculate_household` that builds a policyengine_us/uk Simulation with a situation dict and returns a dot-access HouseholdResult. - Add `tax_benefit_models/common/` with: - `compile_reform(dict) -> core reform dict` (scalar or `{effective_date: value}` shapes) - `dispatch_extra_variables(names)` — flat list, library looks up each name's entity via `variables_by_name` - `EntityResult(dict)` with `__getattr__` for dot access + paste-able-fix AttributeError on unknown names - `HouseholdResult(dict)` with `.to_dict()` / `.write(path)` - Add `utils/household_validation.py` that catches typo'd variable names in entity dicts with difflib close-match suggestions. - Remove `USHouseholdInput`, `UKHouseholdInput`, `USHouseholdOutput`, `UKHouseholdOutput`, and `calculate_household_impact` from both country modules (v4 breaking). - Each country __init__.py exposes `model` (the pinned `TaxBenefitModelVersion`) alongside the existing `us_latest` / `uk_latest` so agents can guess either name. - Rewrite `tests/test_household_impact.py` (19 tests) around the new API: kwargs inputs, dot-access results, flat `extra_variables`, error messages with paste-able fixes, JSON serialization. - Rewrite `tests/test_us_reform_application.py` around reform-dict inputs instead of `Policy(parameter_values=[...])`. - Update `tests/fixtures/us_reform_fixtures.py` to store household fixtures as plain kwargs dicts that splat into `calculate_household(**fixture)`. 223 tests pass locally. Downstream migration (policyengine-api-v2-alpha, the sole consumer of the 3.x surface): replace `calculate_household_impact(input, policy=p)` with `calculate_household(**input, reform=reform_dict)` — fixture script grep of call sites suggests ~25 LOC touched. The migration guide will show the before/after. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The review called out five ship-blockers. This commit fixes all five plus the three footguns: 1. Entity-aware validation. Placing `filing_status` on `people` instead of `tax_unit` now raises with the correct entity and the exact kwarg-swap to make: `tax_unit={'filing_status': <value>}`. 2. Realistic docstring examples. Top-of-module examples in us/household.py and uk/household.py are now lone-parent-with-child cases that exercise every grouping decision (state_code on household, is_tax_unit_dependent on person, would_claim_child_benefit on benunit), not single-adult-no-state cases that hide them. 3. Reform-path validation. `compile_reform` now takes `model_version` and raises with a difflib close-match suggestion on unknown parameter paths, matching the validator quality on variable names. 4. Scalar reform default date. Scalar reform values previously defaulted to `date.today().isoformat()` — a caller running a year=2026 sim mid-2026 got a mid-year effective date and a blended result. Now defaults to `{year}-01-01` (passed through from calculate_household). 5. Unexpected-kwargs catcher. UK `calculate_household(tax_unit=...)` and US `calculate_household(benunit=...)` now raise a TypeError that names the correct country-specific kwarg. Other unexpected kwargs get a difflib close-match from the allowed set. Also added: - `people=[]` check with an explicit error before the calc blows up inside policyengine_us. - Tests for all new error paths (`test__variable_on_wrong_entity`, `test__empty_people`, `test__unknown_reform_path`, `test__us_kwarg_on_uk`, `test__uk_kwarg_on_us`). 151 tests pass locally across the facade + reform + regression suites. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Separates release-manifest + TRACE TRO emission from the core value layer. Consumers who only need Simulation / Policy / Variable / Parameter no longer transitively import h5py through scoping_strategy / constituency_impact / local_authority_impact. File moves: - core/release_manifest.py -> provenance/manifest.py - core/trace_tro.py -> provenance/trace.py New provenance/__init__.py re-exports the public surface (get_release_manifest, build_trace_tro_from_release_bundle, serialize_trace_tro, canonical_json_bytes, etc.). core/__init__.py drops the 20 provenance re-exports and keeps only value objects (Dataset, Variable, Parameter*, Policy, Dynamic, Simulation, Region, scoping strategies, TaxBenefitModel, TaxBenefitModelVersion). Explicit core -> provenance import in tax_benefit_model_version.py. Lazy h5py: - core/scoping_strategy.py: h5py no longer at top of module; imported inside WeightReplacementStrategy.apply() only. - outputs/constituency_impact.py: same. - outputs/local_authority_impact.py: same. Internal callers migrated: - tax_benefit_models/{us,uk}/model.py - tax_benefit_models/{us,uk}/datasets.py - countries/{us,uk}/regions.py - cli.py - results/trace_tro.py - scripts/generate_trace_tros.py - tests/test_{release_manifests,trace_tro,manifest_version_mismatch}.py - docs/release-bundles.md 216 tests pass locally across the v4 surface. `from policyengine.core import Simulation` + `from policyengine.provenance import get_release_manifest` both work without h5py installed (verified by temporarily uninstalling and retrying). The full `import policyengine as pe` still pulls h5py because policyengine_us / policyengine_uk import it eagerly (upstream); that's outside our control. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two byte-identical classes split only by British/American spelling (program_name vs programme_name). Collapsed into a single policyengine.outputs.ProgramStatistics; both country analysis helpers import it from there now. Saves ~106 LOC of duplication and removes an API-surface footgun for cross-country code. Changes: - Add policyengine/outputs/program_statistics.py with the unified class. - Re-export from policyengine/outputs/__init__.py. - Delete tax_benefit_models/us/outputs.py and tax_benefit_models/uk/outputs.py. - us/__init__.py and uk/__init__.py re-export from policyengine.outputs. - uk/analysis.py: rename programme_name -> program_name, programme_statistics -> program_statistics, programmes -> programs, programme_df/collection -> program_df/collection. Field on PolicyReformAnalysis also changes. Migration for callers: - from policyengine.tax_benefit_models.uk import ProgrammeStatistics -> from policyengine.outputs import ProgramStatistics - stats.programme_name -> stats.program_name 205 tests pass locally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Pulls ~300 lines of shared init/save/load logic out of PolicyEngineUSLatest and PolicyEngineUKLatest into a MicrosimulationModelVersion base in tax_benefit_models.common. The base handles: - Release-manifest fetch + installed-version warning - Data-release certification - Variable/parameter population from the country system - save() / load() + output-dataset filepath convention - _build_entity_relationships via declared group_entities Subclasses declare country_code, package_name, group_entities, entity_variables, and implement four thin hooks (_load_system, _load_region_registry, _dataset_class, _get_runtime_data_build_metadata). run() intentionally stays per-country: the US applies reforms at Microsimulation construction and manually copies structural columns, while the UK wraps inputs as UKSingleYearDataset and applies reforms after construction. Hiding those behind a shared skeleton would mask real divergence. Behaviour preservation is guarded by a byte-level snapshot test (tests/test_base_extraction_snapshot.py) covering four US and four UK household cases plus a model-surface snapshot. All 391 tests pass with zero snapshot drift.

- README and core-concepts now lead with pe.uk/pe.us entry points and pe.uk.calculate_household / pe.us.calculate_household (flat kwargs, dot-access result, dict reforms). - economic-impact-analysis, country-models-{uk,us}, and regions-and-scoping switched from `from policyengine.tax_benefit_models...` to the top-level facade. - Removed the "Legacy filter fields" section from regions-and-scoping now that filter_field/filter_value have been dropped (v4 breaking). - dev.md package-layout diagram updated to mention common/ base, provenance/ subpackage, and the MicrosimulationModelVersion extraction. - examples/household_impact_example.py rewritten against the v4 API and verified end-to-end against both UK and US models.

Unifies the v4 reform surface: the same flat {"param.path": value} / {"param.path": {date: value}} dict already accepted by pe.{uk,us}.calculate_household(reform=...) now works on population Simulation too. Dicts are compiled to Policy / Dynamic objects in a model_validator(mode="after") using tax_benefit_model_version for parameter-path validation and dataset.year for scalar effective-date defaulting. Adds compile_reform_to_policy / compile_reform_to_dynamic helpers in tax_benefit_models.common.reform, tested directly in tests/test_dict_reforms_on_simulation.py (6 tests covering scalar defaulting, effective-date mappings, path validation, pass-through of existing Policy objects, and the "no model_version" error path). Unknown parameter paths raise with close-match suggestions (same behaviour as the household calculator) so agents don't silently get a no-op reform from a typo. 397/397 tests pass. End-to-end microsim with Simulation(policy={"gov.irs.credits.ctc.amount.base[0].amount": 3000}) produces the same -$25.5B revenue impact as the manual Policy+ParameterValue construction it replaces.

The fixture is already registered in conftest.py; pytest auto-injects it by parameter name. Importing it explicitly triggered F811.

Bumps to 4.0.0 and addresses three reviewer passes (practitioner, code-simplifier, end-to-end verification) before v4 ships: Version / branding - pyproject.toml: 3.6.0 -> 4.0.0 - release_manifests/{us,uk}.json: bundle_id and policyengine_version bumped to 4.0.0 so the bundle TRO URLs point at the right git tag - test_release_manifests.py: assertion values updated API ergonomics - Simulation class now carries a full __doc__ with the canonical dict- reform call shape; help(pe.Simulation) used to return Pydantic boiler- plate, which hid the headline v4 feature from any agent that hits help() before reading source. - RowFilterStrategy.variable_value: Union[str, int, float]. Numeric columns (state_fips, county_fips) are now scopable; "state_code" doesn't exist on enhanced_cps_2024 so docs directed users at a column that would crash. - pe.__all__ now exports `outputs` so a fresh agent can tab-complete from pe. to the Aggregate/ChangeAggregate family without reading source. Docs - README: state_code_str -> state_code (consistent with us/household.py) - core-concepts.md: "Reform as a dict" section leads, "Reform as a Policy object" relegated to the escape-hatch appendix - economic-impact-analysis.md: both US and UK examples collapsed to single-line reform dicts (was 20 lines each of Parameter/ ParameterValue boilerplate) - country-models-{us,uk}.md: "Common policy reforms" sections rewritten as one-liners (lost ~130 lines of deprecated-ceremony boilerplate) - regions-and-scoping.md: variable_name="state_code" (broken) -> variable_name="state_fips", variable_value=6 - reform.py module docstring: document the [N].amount / [N].threshold indexed-parameter convention so agents don't hit the bracket-head trap Code simplification (simplifier review) - model_version.py: except (ValueError, Exception) -> except Exception - reform.py: compile_reform_to_policy / compile_reform_to_dynamic now share a private _compile_reform_to() helper (was 25 lines of copy-paste) - simulation.py: _compile_dict_reforms loops over (field, compiler) pairs instead of branching twice by hand - tests/test_base_extraction_snapshot.py renamed to test_household_calculator_snapshot.py (matches what it actually pins, not the refactor that motivated it); fixture dir follows 397 tests pass. ruff clean.

New subpackage for querying PolicyEngine Variable dependency structure by AST-walking source trees. No runtime dependency on country models — the extractor is pure static analysis, so it works on any `policyengine-us` / `policyengine-uk` checkout (or fork) regardless of whether the jurisdiction is installed. Motivation: grep answers "who mentions this symbol" but not "what is the dataflow DAG." PolicyEngine variables form deep dependency chains (state EITC depends on federal EITC depends on earned income, etc.). A graph-shaped API beats grep for refactor-impact checks, docs generation, and code-introspection queries from policyengine-claude (once migrated to call policyengine.py as the programmatic contract). Public API (from policyengine.graph): - extract_from_path(path) -> VariableGraph - VariableGraph.deps(var) — direct dependencies - VariableGraph.impact(var) — transitive downstream - VariableGraph.path(src, dst) — shortest dependency chain Reference patterns recognized in v1: 1. <entity>("<var>", <period>) — person/tax_unit/spm_unit/household/ family/marital_unit/benunit direct call with string literal arg 2. add(<entity>, <period>, ["v1", "v2"]) — sum helper; each string in the list becomes an edge (also handles ``aggr``) Tracked for v2: parameter edges (parameters(period).gov...), entity.sum("var") method calls, dynamic variable names (string concatenation / f-strings). Real-data smoke: indexes policyengine-us (4,577 variables) in 0.8s. impact("adjusted_gross_income") returns 504 transitively-dependent variables; direct deps are exactly {irs_gross_income, above_the_line_deductions, basic_income} matching the formula body. Tests: tests/test_graph/test_extractor.py (9 pass). The tests load the graph submodule via importlib rather than going through `policyengine/__init__.py`, because the full package init eagerly imports country models which can fail in dev environments with missing release manifests. The graph subpackage is dep-light (stdlib + networkx) so this workaround is both clean and well-motivated. Limitations noted in the module docstrings: - Parameter references not yet captured (v2). - Dynamic variable names skipped (low prevalence). - entity.sum("var") method calls not yet recognized (v2). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…data

- Replace docs/myst.yml with docs/_quarto.yml (same navigation layout). - Convert {literalinclude} directives to Quarto include syntax in examples.md. - Update Makefile docs/docs-serve targets to invoke quarto. - Update CI workflow to install Quarto instead of mystmd and run quarto render. - Add docs-generate-reference Makefile target wiring the new auto-generator. - Drop docs/_build; use docs/_site/ and docs/_freeze/ for Quarto artifacts. Content: all 11 existing MyST pages render cleanly under Quarto without prose changes (CommonMark overlap is near-total). The generator script from the earlier prototype is included but does not run automatically during the PR check.

Replaces the ported MyST docs with a fresh user-journey-oriented set written against the actual policyengine 4.0.0 API surface: - index.md: one-pager with install, minimal example, 'where to go' table - getting-started.md (new): install, first household, first reform - households.md (new): calculate_household reference, entities, reforms, results shape; replaces the households portion of the old 731-line core-concepts.md - reforms.md (new): parametric + time-varying + structural reforms, combining them, validation - microsim.md (new): Simulation, ensure_datasets, managed_microsimulation, memory/performance notes - impact-analysis.md: economic_impact_analysis walkthrough; replaces the longer economic-impact-analysis.md - outputs.md (new): one section per output type with usage examples (Aggregate, ChangeAggregate, DecileImpact, IntraDecileImpact, Poverty, Inequality, geographic impacts, ProgramStatistics, OutputCollection) - regions.md: US states/districts and UK constituencies/local authorities - countries.md: US vs UK entity differences, default datasets, poverty measures, programs - replaces the long country-models-uk/us pair - visualisation.md: to_plotly helpers, household curves, palette - release-bundles.md: provenance manifests and content addressing - examples.md: runnable examples list with Quarto include directives - dev.md: setup, tests, docs, CI, architecture Deleted: core-concepts.md (731 lines -> split across new pages), advanced-outputs.md (replaced by outputs.md), regions-and-scoping.md (-> regions.md), country-models-{uk,us}.md (-> countries.md), economic-impact-analysis.md (-> impact-analysis.md). Total went from 2986 lines to 847 lines with better coverage. Navigation in _quarto.yml reorganized around user journey.

MaxGhenis · 2026-04-20T13:17:30Z

Quick take — I'd like to see this split before landing.

What's actually in the PR (pre-rebase)

Four distinct commits:

3216920 — new policyengine.graph module: AST-based variable dependency extractor (no runtime import of country code)
18cd8e2 — reference-generator prototype using the graph
7a421da — MyST → Quarto toolchain swap (~5 files)
7abae0a — full docs rewrite (2986 → 847 lines, user-journey split)

Strong positives

policyengine.graph is genuinely useful. Static AST means no install needed, which matters for agent sessions. Solves the "grep shows mentions, but what's the dataflow DAG" gap.
Docs rewrite direction is right. The user-journey split (getting-started → households → reforms → microsim → outputs) is better for fresh-session discoverability than the old 731-line core-concepts.md.
Quarto swap is defensible — native multi-output (HTML + PDF from one source), better figure handling.
Reference generator correctly scaffolded — prototype, generated pages not checked in.

Concerns

Too bundled. Four independent concerns in one PR. Review cost scales with coupling; each could stand alone:
- (A) Quarto toolchain swap
- (B) policyengine.graph + reference generator (depends on A if you want Quarto-native output)
- (C) Docs rewrite (depends on A, independent of B)
Suggest splitting so A lands first (small, mechanical), then B and C can review in parallel.
"Rewrite from scratch" may throw away [v4] Refresh docs + example for agent-first surface #295's work we just shipped this week (docs refreshed for v4). Need to confirm the new content isn't re-deriving things we already fixed (e.g., dict-reform-first examples, state_code not state_code_str, [N].amount indexing convention). Worth diffing commit 7abae0a against current docs/ to make sure the wins from [v4] Refresh docs + example for agent-first surface #295 carried over.
Timing. Branched pre-rebase so the 95-file diff is inflated. After Launch v4: agent-first surface, shared microsim base, dict reforms, refreshed docs #298 merges to main and this rebases, the real diff should be ~40 files and much easier to review.
src/policyengine/cli.py snuck in. Not mentioned in any of the 4 commit messages — looks like it got bundled accidentally. Verify intent before shipping; CLI is its own discussion (argparse vs typer, entry point registration, etc.).
Quarto install in CI. quarto render pulls the ~120MB Quarto binary. First-run cost is real. MyST was npm-light. Maybe fine, maybe not — depends on how often docs jobs run. Worth measuring a cold CI run before accepting the trade.
Graph module as PyPI surface. Should it be an optional extra (policyengine[graph])? It's useful but not core to tax-benefit microsim. Keeps the base install lean.

Recommendation

Don't merge as-is. Split into 2-3 PRs, rebase after #298. I'd review A (Quarto swap) in ~15 min, B (graph + generator) in ~45 min, C (docs rewrite) in ~1 hour — bundled, needs a full afternoon and approval confidence is lower.

MaxGhenis · 2026-04-20T13:18:38Z

All six points valid. Plan:

Close this PR. Confirmed the branch was based off a stale local main — that's why unrelated v4 work (cli.py, extra changelog fragments, half-edited core-concepts.md that [v4] Refresh docs + example for agent-first surface #295 already rewrote) showed up in the diff. Not salvageable by rebase alone; cleaner to restart.
After Launch v4: agent-first surface, shared microsim base, dict reforms, refreshed docs #298 merges, open three separate PRs stacked on current main:
- PR A — Quarto toolchain swap. Just _quarto.yml replacing myst.yml, the Makefile change, the CI workflow change, the {literalinclude} → {.python include=} conversion in examples.md. ~5 files. Should review in ~15 min.
- PR B — policyengine.graph + reference generator. Graph module as an optional extra (policyengine[graph] per your suggestion) so base install stays lean. Generator script under docs/_generator/ emits .qmd pages.
- PR C — Docs rewrite. Starts from the Launch v4: agent-first surface, shared microsim base, dict reforms, refreshed docs #298 docs (not the pre-[v4] Refresh docs + example for agent-first surface #295 versions), diffs against my user-journey rewrite, keeps [v4] Refresh docs + example for agent-first surface #295's dict-reform-first examples / state_code naming / [N].amount indexing conventions. Essentially: use my new navigation + page structure, but port over the specific factual updates from [v4] Refresh docs + example for agent-first surface #295.
Measure Quarto CI cost before PR A lands — will add timing numbers to the PR description.
For PR C, once Launch v4: agent-first surface, shared microsim base, dict reforms, refreshed docs #298 is in I'll diff docs/ as-merged against my rewrites at file-by-file level and list specifically what's preserved, what's restructured, what's new. That way the review is "did the rewrite keep all the v4 correctness wins." Not "read 2000 lines of new prose."

Closing this and will reopen as A/B/C once #298 merges.

MaxGhenis and others added 16 commits April 19, 2026 16:53

Snapshot test: freeze US/UK household+model outputs pre-base-extraction

cce3d29

Add changelog fragment for MicrosimulationModelVersion extraction

33c9e44

Drop quotes from inline Simulation type annotations (ruff UP037)

af8b9ea

ruff format + drop redundant us_test_dataset import

9d2609b

The fixture is already registered in conftest.py; pytest auto-injects it by parameter name. Importing it explicitly triggered F811.

Prototype: auto-generate reference .qmd pages from country model meta…

18cd8e2

…data

MaxGhenis closed this Apr 20, 2026

MaxGhenis deleted the prototype-auto-reference branch April 20, 2026 13:18

This was referenced Apr 20, 2026

Swap docs toolchain from MyST to Quarto #304

Merged

Add policyengine.graph and reference-generator prototype #306

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate docs from MyST to Quarto + prototype auto-generated reference#301

Migrate docs from MyST to Quarto + prototype auto-generated reference#301
MaxGhenis wants to merge 16 commits intomainfrom
prototype-auto-reference

MaxGhenis commented Apr 20, 2026

Uh oh!

MaxGhenis commented Apr 20, 2026

Uh oh!

MaxGhenis commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MaxGhenis commented Apr 20, 2026

1. MyST → Quarto toolchain swap

2. Prototype: auto-generated reference

Why Quarto

Test plan

Uh oh!

MaxGhenis commented Apr 20, 2026

What's actually in the PR (pre-rebase)

Strong positives

Concerns

Recommendation

Uh oh!

MaxGhenis commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant