Incremental sink reads: per-row ingest-seq watermark for forward + blob sinks (LLP 0039/0040) by philcunliffe · Pull Request #159 · hyparam/hypaware

philcunliffe · 2026-06-26T00:47:17Z

Implements the incremental sink reads change set — sinks read only rows added since their last successful export instead of re-reading the whole partition every tick.

Request: LLP 0039 (escalated from central forward sink has no cursor — re-reads & re-sends the whole dataset every tick #122) · Design: LLP 0040 · Plan: LLP 0042
T1 stamp a monotonic _hyp_ingest_seq at the decorateRow write chokepoint (crash-safe allocator)
T2 extend storage.readRows with cursor-aware since/continuation + null-seq migration
T3 persist a per-(sink instance, partition) watermark keyed by logical partition path
T4 wire the central forward sink to incremental read
T5 wire the core blob sink (local-fs + s3)
T6 exactly-once tests across retention prune + compaction generation swap

Each task landed as its own verified --no-ff merge with green CI. Server idempotency ledger retained as the in-flight retry net.

Change-Set: incremental-sink-reads

Cover LLP 0039 with a neutral-minted design for a per-(sink, partition) watermark so the central forward sink and the core blob sink read and ship only rows added since their last successful export. Recommends a monotonic per-row _hyp_ingest_seq column over snapshot ancestry (does not survive a compaction generation swap) and a content-addressed seen-set (cannot meet the bounded-read goal). Specifies the readRows since/ continuation extension, the persisted watermark contract keyed by the generation-stable logical partition path, application to both sinks, and the exactly-once argument across retention prunes and compaction swaps. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Refine LLP 0040 into six small, independently-mergeable tasks along the producer -> read-API -> persistence -> consumer seam: T1 stamp _hyp_ingest_seq at the decorateRow chokepoint (deps: []) T2 readRows since/continuation + readRowsSince (deps: T1) T3 per-(sink,partition) watermark store keyed by logical path (deps: T2) T4 wire the central forward sink (deps: T2,T3) T5 wire the core blob sink (deps: T2,T3) T6 exactly-once tests across retention prune + compaction swap (deps: T4,T5) Verified with `neutral ready incremental-sink-reads --json`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… chokepoint Adds the row-resident, append-monotonic int64 watermark column that the incremental-sink-reads design (LLP 0040, Candidate B) is built on. This is the producer half of the seam: nothing reads the column yet, and it is stripped by INTERNAL_FIELDS from every existing readRows consumer, so it merges with zero behavioural change. - New `createIngestSeqAllocator` (src/core/cache/ingest-seq.js): a crash-safe, never-regressing monotonic int64 allocator. Reserve-before-stamp — a block of seqs is durably persisted (nextSeq advanced via atomic write-rename) before any seq in it is handed to a row, so a resumed flush never re-issues a seq <= one already stamped/exported. Gaps are tolerated; regressions are not. The counter is cache-global (<cacheRoot>/_hyp_ingest_seq.json), not a per-partition cursor.json, because decorateRow runs before rows are grouped into source= partitions and two spool paths (live + backfill) can feed one partition — only a cache-wide counter keeps every partition's seq subsequence strictly increasing. (LLP 0040 §7 records this refinement of risk #2.) - streaming-reader.js: decorateRow stamps `_hyp_ingest_seq` (the cache_row_id hash is still computed over the original row, so seq does not perturb dedup); the chunk's columns gain the additive nullable INT64 column so it lands in the Iceberg schema and rides a compaction generation swap verbatim; the field joins INTERNAL_FIELDS. - spool.js wires one cache-global allocator into the flush loop. - Tests: allocator monotonicity / never-regress-across-restart / reserve- before-stamp / concurrency; streamFlushFile stamping; and a storage round-trip proving the seq persists in Iceberg, increases per row, and is stripped from readRows. Verified separately that the column survives a real compaction swap. Task-Id: T1 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>