Skip to content

feat: implement Phase-4 per-element decisions ledger#13

Open
abimaelmartell wants to merge 1 commit into
mainfrom
feat/phase4-decisions-ledger
Open

feat: implement Phase-4 per-element decisions ledger#13
abimaelmartell wants to merge 1 commit into
mainfrom
feat/phase4-decisions-ledger

Conversation

@abimaelmartell
Copy link
Copy Markdown
Member

Summary

The decisions ledger was a stub: ExtractResult.decisions was hardcoded None and output_decisions did nothing — a public option that silently lied. This implements it.

When output_decisions (NAPI: outputDecisions) is set, the result carries:

  • the kept main container (kept: true), then
  • a Decision per boilerplate block post-clean dropped (kept: false),

each with a CSS-selector-shaped selector (tag + sorted .classes + #id), the element's text score (share of the kept subtree's text), and a keep/drop confidence.

Off by default — zero cost on the normal path (the ledger Vec is only built when requested).

Why

The offline rule-learner needs to know which containers the extractor dropped to mine per-domain boilerplate signatures — its extractWithCandidates path needs exactly this and was blocked on it. The selector shape matches the learner's CandidateSignature.signature, so it can consume the ledger directly and derive the text/HTML/n-gram fields from the raw HTML it already holds. The ledger is also generally useful as "why did the extractor drop this?" telemetry.

Scope

This is the minimal ledger: it records the container-level keep/drop decisions where boilerplate stripping actually happens (post_clean), not every node, and doesn't reconstruct sample text / outer HTML in Rust (the consumer already has the raw HTML).

Implementation

  • post_clean records a Decision per drop into CleanedRoot.decisions (guarded by output_decisions).
  • lib.rs prepends the kept-root anchor and sets result.decisions.
  • Element::selector() builds the deterministic signature.
  • Decision exported from the crate root; surfaced on the NAPI ExtractResult.

Test plan

  • cargo test -p html-extractor — 33 unit + 1 golden + 10 integration (2 new: ledger on/off) + 1 doctest pass
  • node --test11 NAPI tests (1 new: outputDecisions ledger + default-off)
  • cargo fmt --check + cargo clippy --workspace --all-targets -- -D warnings clean
  • Golden corpus unchanged (ledger is additive; default path untouched)

🤖 Generated with Claude Code

The decisions ledger was a stub: ExtractResult.decisions was always
None and output_decisions was a no-op. Implement it so the offline
rule-learner can mine which boilerplate containers the extractor
dropped (its extractWithCandidates path needs exactly this).

When output_decisions is set, the result carries the kept main
container followed by a Decision per boilerplate block post-clean
dropped, each with a CSS-selector-shaped signature (tag + sorted
classes + #id), the element's text share, and a keep/drop
confidence. Off by default — zero cost on the normal path (the
ledger Vec is only built when requested).

Plumbed through post_clean (where drops happen) -> CleanedRoot ->
lib.rs, exported Decision from the crate root, and surfaced on the
NAPI ExtractResult. Adds Rust + NAPI tests; the selector format
matches the learner's CandidateSignature so it can use it directly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant