Skip to content

feat(cli): emit info.json and lineage_diff.json in recce init --cloud [DRC-3296]#1336

Merged
even-wei merged 3 commits intomainfrom
feature/drc-3296-a2-info-and-lineage-diff-emission
Apr 27, 2026
Merged

feat(cli): emit info.json and lineage_diff.json in recce init --cloud [DRC-3296]#1336
even-wei merged 3 commits intomainfrom
feature/drc-3296-a2-info-and-lineage-diff-emission

Conversation

@even-wei
Copy link
Copy Markdown
Contributor

PR checklist

  • Ensure you have added or ran the appropriate tests for your PR.
  • DCO signed

What type of PR is this?

feat

What this PR does / why we need it:

Adds info.json and lineage_diff.json to the cloud-mode precompute artifacts emitted by recce init --cloud. These serve Cloud's /info and /select endpoints directly from S3 without re-computing lineage from raw dbt artifacts on every request.

Both files are derived purely from the adapter's already-loaded manifests and catalogs:

  • info.json — the artifact-sourced portion of the /info response (adapter_type + merged lineage), built via build_merged_lineage (same helper the OSS /api/info endpoint uses). Cloud-specific fields (org_id, project_id, cloud_mode, pull_request, etc.) are injected by the Cloud API at request time.
  • lineage_diff.json — the raw LineageDiff (base / current / diff) emitted by adapter.get_lineage_diff().

Both are raw JSON (no compression, per scope decision). No new dependencies; no SQL execution beyond what recce init already does.

Upload contract: info.json uploaded via info_url, lineage_diff.json via lineage_diff_url keys on get_upload_urls_by_session_id. Graceful degradation: if either key is missing in the upload URL response, we log a warning and continue (exit 0) — keeps old CLI + new Cloud (and vice-versa) compatible during the rollout.

Scratch dir: cloud mode uses a recce-metadata-* tempdir (distinct prefix from A1's recce-cll-*), cleaned up via shutil.rmtree on successful upload, preserved on failure for debugging.

Which issue(s) this PR fixes:

Resolves DRC-3296 — https://linear.app/recce/issue/DRC-3296

Special notes for your reviewer:

  • Parallel to A1 (feat: precompute per_node.db in recce init --cloud [DRC-3295] #1334). Expect a small recce/cli.py rebase conflict in the cloud-upload block when one merges first — trivial to resolve (both PRs append upload branches side by side).
  • Cloud-side follow-up: info_url and lineage_diff_url keys need to be added to get_upload_urls_by_session_id in the cloud backend. This PR ships the CLI-side emit+upload logic with graceful missing-key handling; the cloud-side work is tracked separately and not in this PR's scope.
  • The emitter (recce/util/info_emitter.py) reuses build_merged_lineage and adapter.get_lineage_diff() so the JSON shapes stay in sync with the live-instance /api/info and /api/select responses.

Tests:

  • Unit tests (tests/test_info_emitter.py, 9 tests) — adapter-type resolution, payload shape/contract, atomic write, .tmp cleanup on failure, both-files write, parent-dir creation, pydantic round-trip.
  • CLI integration tests (tests/test_cli_info_emitter.py, 6 tests) — happy path both-URLs-present, missing info_url warn+continue, missing lineage_diff_url warn+continue, both-URLs-missing warn+no-upload, local-mode non-regression (no emit, no upload), metadata-scratch prefix distinctness + cleanup.
  • Full suite: 1093 passed / 18 deselected. Only pre-existing test_spa_route_* failures remain (SPA build missing in OSS test env, unchanged from baseline).

Does this PR introduce a user-facing change?:

NONE

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 23, 2026

Codecov Report

❌ Patch coverage is 96.17978% with 17 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
tests/test_cli_info_emitter.py 96.21% 10 Missing ⚠️
recce/cli.py 87.17% 5 Missing ⚠️
recce/util/info_emitter.py 95.23% 2 Missing ⚠️
Files with missing lines Coverage Δ
tests/test_info_emitter.py 100.00% <100.00%> (ø)
recce/util/info_emitter.py 95.23% <95.23%> (ø)
recce/cli.py 66.71% <87.17%> (+0.57%) ⬆️
tests/test_cli_info_emitter.py 96.21% <96.21%> (ø)

... and 4 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@even-wei even-wei self-assigned this Apr 24, 2026
DRC-3296. Adds info.json + lineage_diff.json to the cloud-mode precompute
artifacts so Cloud's /info and /select endpoints can serve responses
directly from S3 without re-computing lineage from raw dbt artifacts on
every request.

Both files are derived purely from the adapter's already-loaded manifests
and catalogs via build_merged_lineage (same helper the OSS /api/info
endpoint uses) and adapter.get_lineage_diff() — no SQL, no extra I/O,
no new dependencies.

Upload is keyed by info_url / lineage_diff_url in
get_upload_urls_by_session_id. Missing keys produce a warning and
continue (exit 0), keeping old-CLI + new-Cloud and vice-versa compatible.
Written as raw JSON (no compression, per DRC-3296 scope).

Uses a distinct recce-metadata-* tempdir prefix (distinct from A1's
recce-cll-*) and shutil.rmtree's it on successful upload.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: even-wei <evenwei@infuseai.io>
@even-wei even-wei force-pushed the feature/drc-3296-a2-info-and-lineage-diff-emission branch from 0aaeea4 to 10db78a Compare April 24, 2026 07:29
@even-wei
Copy link
Copy Markdown
Contributor Author

Code Review: PR #1336

Files reviewed: 4 (recce/util/info_emitter.py, recce/cli.py, tests/test_info_emitter.py, tests/test_cli_info_emitter.py)
Categories: Business logic, Tests
Passes run: A, C, D, E, F, G, H

Validation Results

Pass A: Correctness & Logic — PASS (minor)

  • Emitter correctly reuses build_merged_lineage (same helper used by the server /api/info endpoint at recce/server.py:597), and model_dump(exclude_none=True, by_alias=True) matches the server's serialization exactly — wire-format drift risk is contained.
  • _resolve_adapter_type preference order (manifest metadata → adapter.type() → None) is sensible; falls back cleanly when manifests are absent.
  • NodeDiff / LineageDiff contract from recce/models/types.py verified — lineage_diff.json round-trip test is valid.

Pass C: Cross-Reference Consistency — PASS

  • Upload-URL keys info_url / lineage_diff_url match sibling PR recce-cloud-infra#1243 (verified against api_server/apis/sessions_api.py). Contract is aligned.
  • adapter.get_lineage_diff() signature matches BaseAdapter.get_lineage_diff in recce/adapter/base.py:23.
  • Test mocks (_make_mock_adapter, _make_adapter) correctly mirror the real adapter attribute surface used by the emitter.

Pass D: Error Handling & Edge Cases — FAIL

ISSUErecce/cli.py:709-713 combined with cli.py:884-911 — Partial-emit failure silently produces a misleading success message. If emit_info_and_lineage_diff raises after writing info.json but before lineage_diff.json (e.g., disk full on the second write, or any exception during the second pydantic model_dump), both info_path and lineage_diff_path are reset to None. Later, in the upload loop, when local_path is None but metadata_upload_url is truthy, neither branch of the if/elif fires — the skip is silent and nothing is appended to upload_failures. The terminal summary then prints Cloud upload complete. even though metadata was never uploaded. The emit-time warning is the only signal, and a user or operator scanning only the final status line will miss it. Fix options: (a) append both filenames to upload_failures inside the emit-except block, or (b) add an else arm covering "URL present but local file missing" that warns and appends.

NOTErecce/util/info_emitter.py:92json.dumps(..., default=str) silently coerces unexpected non-JSON-serializable types (datetime, UUID, Decimal, etc.) to their str() form. Since LineageDiff.base/current are untyped dict, anything deep in manifest_metadata could be stringified without warning. For the current contract (round-trip to LineageDiff.model_validate, which also accepts dicts-of-anything) this is harmless, but it hides future data-type regressions rather than surfacing them.

Pass E: Test Coverage & Quality — PASS (one gap)

NOTE — No test exercises the partial-emit failure path (first file written, second write raises). The existing test_cleans_up_tmp_on_failure covers _write_json_atomic internals, not the two-file sequencing in emit_info_and_lineage_diff. Combined with the Pass D issue above, this is the highest-value missing test: mock a second-write failure and assert the final CLI summary reports a partial failure rather than Cloud upload complete..

  • Happy-path, missing-info_url, missing-lineage_diff_url, both-missing, local-mode non-regression, scratch-dir prefix + cleanup: all covered. 15/15 pass locally.
  • Round-trip test through LineageDiff.model_validate is a strong contract test — good.

Pass F: Diff-Specific Checks — PASS

  • Rebase onto A1 (feat: precompute per_node.db in recce init --cloud [DRC-3295] #1334) merged cleanly: per_node_scratch and metadata_scratch coexist in one try/finally block with independent cleanup. Scratch prefixes (recce-per-node-, recce-metadata-) do not collide. Test test_scratch_dir_has_distinct_prefix_and_is_cleaned_up pins this.
  • Control flow change is structurally sound: outer if is_cloud: wraps emit + upload under a single try/finally for cleanup, with if cloud_client: guarding the upload block and if upload_urls is not None: guarding the URL-dependent section. No code path leaks scratch dirs.
  • _UPLOAD_TIMEOUT = 600s applied consistently to all four upload calls.

Pass G: Performance — PASS

  • No SQL execution, no N+1. Single model_dump call per payload (two payloads total). File writes are atomic via sibling .tmp + replace.
  • No streaming needed at current artifact scale (KB-range JSON).

Pass H: Async/Concurrency — N/A

  • All synchronous code paths. No asyncio, no threads.

Verification Results

  • uv run pytest tests/test_info_emitter.py tests/test_cli_info_emitter.py — 15 passed / 0 failed.
  • uv run ruff check on changed files — clean.
  • uv run black --check --line-length 120 (repo's actual formatter) — clean.

Verdict: NEEDS-CHANGES

Issues

  1. recce/cli.py:709-713 + cli.py:884-911 — Partial emit failure silently produces a false "Cloud upload complete." terminal message because the None local_path path neither uploads nor appends to upload_failures. Either append on emit-except, or add an explicit missing-file branch in the upload loop.

Notes

  1. recce/util/info_emitter.py:92default=str in json.dumps silently stringifies unexpected types; harmless today but masks future regressions in manifest_metadata typing.
  2. tests/test_cli_info_emitter.py — No test for partial-emit failure path (info.json written, lineage_diff.json write raises). Adding this would pin the Issue Support dataframe diff #1 fix.
  3. The metadata_uploads loop at cli.py:880-911 could be extracted to a small helper (_upload_artifact(...)) — the same PUT-and-handle-response pattern repeats three times in this function and diverges only in display name, URL key, and content type. Non-blocking; just easier to maintain as Phase III adds more artifacts.

What I could not verify

What I looked for and did not find

  • Secrets or PII in emitted JSON: none (payload is derived from manifest/catalog structure, not query results or auth data).
  • Mock drift: all mocks match real signatures (get_upload_urls_by_session_id, get_lineage_diff, build_full_cll_map, adapter.type()).
  • Resource leaks: file handles use with open(...) in all four upload paths; scratch dirs cleaned up in finally.
  • Cross-rebase breakage with A1's per_node.db flow: coexistence is correct; per_node emit is gated on per_node_db_url presence and uses a distinct scratch dir.
  • Off-by-one / boundary issues in the emit-vs-upload gate: none — control flow is linear and correctly guarded.
  • Wire-format drift between CLI emit and server /api/info: identical model_dump(exclude_none=True, by_alias=True) path; contract is pinned by reuse of build_merged_lineage.

Copy link
Copy Markdown
Contributor Author

@even-wei even-wei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review: 1 ISSUE + 3 NOTEs (NEEDS-CHANGES). Partial-emit failure silently reports 'Cloud upload complete.' — see review comment for details.

Addresses self-review finding on #1336 (DRC-3296) — previously, partial
failures of emit_info_and_lineage_diff silently reported "Cloud upload
complete.". When the emitter wrote info.json but then raised before
lineage_diff.json was written, both local paths were reset to None. The
upload loop's if/elif then fell through — upload_failures stayed empty
and the success banner printed, masking the real failure.

Add an explicit "URL present + local file missing" arm to the metadata
upload loop. On that path we append the artifact to upload_failures and
print a warning, so the end-of-run summary reports the partial failure.
Chose approach B (upload-loop guard) over approach A (broader try/except
around the emit) so emit exceptions stay separate from upload errors and
the emit-time warning still fires.

New test covers the partial-emit failure path: mock emit to write info.json
then raise, assert no false success banner, both artifacts listed in
upload_failures, and the "completed with warnings" summary fires.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: even-wei <evenwei@infuseai.io>
@gcko gcko self-requested a review April 27, 2026 00:58
@gcko
Copy link
Copy Markdown
Contributor

gcko commented Apr 27, 2026

Updated Review — PR #1336

Summary

Re-review after a24c1cf6 fix(cli): drop misleading scratch path from emit success messages. The first nit from the prior review (misleading "Metadata saved to {scratch_dir}" path) is fixed — both the metadata and per_node.db success messages now print …emitted (size, elapsed) without referencing the temp dir that's about to be deleted in finally. Verification re-run on a24c1cf6: make flake8 clean; pytest tests/test_info_emitter.py tests/test_cli_info_emitter.py → 16 passed.

Findings

[Resolved] Misleading "Metadata saved to …" path is deleted on the way out

File: recce/cli.py:705
Status: Fixed in a24c1cf6. Message now reads Metadata emitted (info.json X.X KB, lineage_diff.json X.X KB, X.Xs). Same fix applied to the per_node.db emit message at recce/cli.py:777.

[Unaddressed — low priority] Two tempfile.mkdtemp calls when sometimes neither is used

File: recce/cli.py:690-691
Issue: Still allocates per_node_scratch and metadata_scratch unconditionally before checking which upload URLs are present. Wasted I/O on the rollout edge where Cloud hasn't shipped the new URL keys yet. Carried over from the prior review — restating for completeness, not blocking.
Suggestion: Defer mkdtemp until after get_upload_urls_by_session_id returns. Low priority — emit cost is negligible vs. the surrounding upload work.

Verdict

Approved. The actionable nit is resolved, the remaining nit is genuinely low-priority and can be deferred or skipped. Ship it.

Copy link
Copy Markdown
Contributor

@gcko gcko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review: No critical issues found.

Address PR review feedback (gcko, Nit 1): both ``metadata_scratch`` and
``per_node_scratch`` are unconditionally cleaned up in the surrounding
``finally`` block, so printing "saved to <path>" pointed users at a
directory that no longer existed by the time the command exited. Drop
the path from both messages and keep only the load-bearing info
(file sizes + elapsed time).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: even-wei <evenwei@infuseai.io>
@even-wei
Copy link
Copy Markdown
Contributor Author

@gcko Pushed a24c1cf addressing review nits.

Nit 1 (misleading "Metadata saved to {metadata_scratch}") — fixed. Dropped the path from the success message; kept the load-bearing info (file sizes + elapsed). Applied the same fix to the per_node.db print one block down — same pattern (per_node_scratch is also cleaned up unconditionally in finally), so left to itself it would have read inconsistently against the new metadata message.

Nit 2 (eager tempfile.mkdtemp for metadata_scratch / per_node_scratch before knowing which artifacts get uploaded) — agreed in principle, but holding off here. The cost is two empty-dir mkdtemp calls plus, in the rollout-edge case, two small JSON writes that don't get uploaded — well under any user-perceptible threshold. Restructuring the control flow to defer mkdtemp until after get_upload_urls_by_session_id returns is a non-trivial refactor that would also reshuffle the surrounding try/finally cleanup logic. Better as a follow-up once Phase III lands more artifacts and the pattern repeats.

Verification: 16/16 tests pass (test_info_emitter.py + test_cli_info_emitter.py), black/flake8 clean.

Copy link
Copy Markdown
Contributor

@gcko gcko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review: Updated — prior nit fixed in a24c1cf, no critical issues.

@even-wei even-wei merged commit 1b950f6 into main Apr 27, 2026
22 checks passed
@even-wei even-wei deleted the feature/drc-3296-a2-info-and-lineage-diff-emission branch April 27, 2026 01:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants