feat: implement Phase-4 per-element decisions ledger#13
Open
abimaelmartell wants to merge 1 commit into
Open
Conversation
The decisions ledger was a stub: ExtractResult.decisions was always None and output_decisions was a no-op. Implement it so the offline rule-learner can mine which boilerplate containers the extractor dropped (its extractWithCandidates path needs exactly this). When output_decisions is set, the result carries the kept main container followed by a Decision per boilerplate block post-clean dropped, each with a CSS-selector-shaped signature (tag + sorted classes + #id), the element's text share, and a keep/drop confidence. Off by default — zero cost on the normal path (the ledger Vec is only built when requested). Plumbed through post_clean (where drops happen) -> CleanedRoot -> lib.rs, exported Decision from the crate root, and surfaced on the NAPI ExtractResult. Adds Rust + NAPI tests; the selector format matches the learner's CandidateSignature so it can use it directly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The decisions ledger was a stub:
ExtractResult.decisionswas hardcodedNoneandoutput_decisionsdid nothing — a public option that silently lied. This implements it.When
output_decisions(NAPI:outputDecisions) is set, the result carries:kept: true), thenDecisionper boilerplate block post-clean dropped (kept: false),each with a CSS-selector-shaped
selector(tag+ sorted.classes +#id), the element's textscore(share of the kept subtree's text), and a keep/dropconfidence.Off by default — zero cost on the normal path (the ledger
Vecis only built when requested).Why
The offline rule-learner needs to know which containers the extractor dropped to mine per-domain boilerplate signatures — its
extractWithCandidatespath needs exactly this and was blocked on it. Theselectorshape matches the learner'sCandidateSignature.signature, so it can consume the ledger directly and derive the text/HTML/n-gram fields from the raw HTML it already holds. The ledger is also generally useful as "why did the extractor drop this?" telemetry.Scope
This is the minimal ledger: it records the container-level keep/drop decisions where boilerplate stripping actually happens (
post_clean), not every node, and doesn't reconstruct sample text / outer HTML in Rust (the consumer already has the raw HTML).Implementation
post_cleanrecords aDecisionper drop intoCleanedRoot.decisions(guarded byoutput_decisions).lib.rsprepends the kept-root anchor and setsresult.decisions.Element::selector()builds the deterministic signature.Decisionexported from the crate root; surfaced on the NAPIExtractResult.Test plan
cargo test -p html-extractor— 33 unit + 1 golden + 10 integration (2 new: ledger on/off) + 1 doctest passnode --test— 11 NAPI tests (1 new:outputDecisionsledger + default-off)cargo fmt --check+cargo clippy --workspace --all-targets -- -D warningsclean🤖 Generated with Claude Code