Skip to content

tune: lower coalesce/settle step 40 → 10 ms#674

Merged
therealaleph merged 1 commit intotherealaleph:mainfrom
yyoyoian-pixel:tune/coalesce-step-10ms-v2
May 3, 2026
Merged

tune: lower coalesce/settle step 40 → 10 ms#674
therealaleph merged 1 commit intotherealaleph:mainfrom
yyoyoian-pixel:tune/coalesce-step-10ms-v2

Conversation

@yyoyoian-pixel
Copy link
Copy Markdown
Contributor

Summary

Lower the adaptive coalesce step (client) and straggler settle step (tunnel-node) from 40 ms to 10 ms. Raise the tunnel-node settle max from 500 ms to 1000 ms.

Rationale

The batch pipeline has two phases where data can accumulate:

  1. Client coalesce — ops queue up before firing a batch to Apps Script
  2. Tunnel-node settle — upstream TCP replies trickle in before the batch response is sent back

The 40 ms step was conservative: it gave each phase a wide window to pack more data per batch, saving Apps Script round-trips. But this window is wasted on downloads — when the client is just waiting for data and has nothing new to send, each 40 ms step is pure dead air before the batch fires.

The fix is asymmetric by design:

  • Step: 40 → 10 ms — When there's nothing else to pack, fire almost immediately. The 10 ms still catches ops that land in the same event-loop tick (e.g. a browser opening 6 parallel connections on page load), so we don't degenerate into single-op batches on a burst.

  • Max (client): stays at 1000 ms — When both sides do have data (uploads, bursty page loads), the adaptive reset keeps packing: each arriving op resets the 10 ms step timer, so a rapid burst naturally coalesces up to the 1 s cap. This saves quota by packing many ops into fewer round-trips.

  • Settle max (tunnel-node): 500 → 1000 ms — More room to pack straggler upstream replies when targets respond at different speeds. One slow CDN shouldn't force a premature flush that wastes a whole Apps Script call for the late reply.

In short: don't wait when there's nothing to wait for; batch aggressively when there is.

Changes

Component Setting Old New
Client coalesce_step_ms 40 ms 10 ms
Client coalesce_max_ms 1000 ms 1000 ms (unchanged)
Tunnel-node STRAGGLER_SETTLE_STEP 40 ms 10 ms
Tunnel-node STRAGGLER_SETTLE_MAX 500 ms 1000 ms

Backwards compatibility

Users with "coalesce_step_ms": 40 in config.json keep the old behaviour — the compiled default only applies when the field is absent or 0.

Test plan

  • Full tunnel: load a heavy page, confirm downloads feel snappier (~30 ms less per batch)
  • Upload a large file via full tunnel, check logs for batch sizes (should still coalesce well)
  • Long-running session (Telegram, WebSocket), verify no regressions in push latency
  • Android: fresh install picks up 10 ms default; existing config with explicit 40 is preserved

🤖 Generated with Claude Code

@yyoyoian-pixel yyoyoian-pixel force-pushed the tune/coalesce-step-10ms-v2 branch from 3d89577 to 72ff23f Compare May 3, 2026 11:16
…ettle max to 1 s

The batch coalesce step controls how long the client (and the
tunnel-node's straggler settle) waits between checking for more ops
to pack into the same batch.  At 40 ms the wait was conservative —
good for packing uploads but needlessly slow on the download path
where the tunnel-node round-trip, not coalescing, is the bottleneck.

Lowering the step to 10 ms means we fire batches almost immediately
when there's nothing else queued, cutting ~30 ms of dead air on
every download-dominated round-trip.  When both sides DO have data
in flight (uploads, bursty page loads), the adaptive reset still
works: each arriving op resets the 10 ms step timer, so a rapid
burst naturally coalesces up to the 1 s hard cap without wasting
quota on many small batches.

In short: don't wait when there's nothing to wait for; batch
aggressively when there is.

Client side:
  - DEFAULT_COALESCE_STEP_MS  40 → 10 ms
  - DEFAULT_COALESCE_MAX_MS   unchanged at 1000 ms

Tunnel-node side:
  - STRAGGLER_SETTLE_STEP     40 → 10 ms  (matches client step)
  - STRAGGLER_SETTLE_MAX     500 → 1000 ms (more room to pack
    straggler responses when upstream targets reply at different
    speeds — saves Apps Script quota on the return leg)

Users who prefer the old behaviour can set "coalesce_step_ms": 40
in config.json.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@yyoyoian-pixel yyoyoian-pixel force-pushed the tune/coalesce-step-10ms-v2 branch from 72ff23f to 71d774c Compare May 3, 2026 11:19
@therealaleph
Copy link
Copy Markdown
Owner

@yyoyoian-pixel — reasoning is sound and I tested locally (179 lib + 33 tunnel-node tests green). The asymmetric design (small step, generous max) is the right framing — fast-fire when nothing else is queued, but adaptive coalesce on bursts. Merging.

Will ship in v1.9.8 with a note that timing-sensitive deployments can override via coalesce_step_ms in config.json (the env var fallback path you preserved). Thanks.


[reply via Anthropic Claude | reviewed by @therealaleph]

@therealaleph therealaleph merged commit 994dd0b into therealaleph:main May 3, 2026
1 check passed
therealaleph added a commit that referenced this pull request May 3, 2026
…apps_script modes

Android (#666 from @ilok67 with full root cause):
- MainActivity.onStop was sending ACTION_STOP via startService() AND immediately calling stopService() on the same service. ACTION_STOP runs teardown() on a background thread that stopSelf()s at the end; the redundant stopService() triggered onDestroy() in parallel, racing the lifecycle and crashing on every Disconnect tap. Removed the stopService() — ACTION_STOP alone is sufficient for both the live-service and the zombie-after-process-death cases. The tornDown AtomicBoolean already guards against double-teardown of native state but couldn't protect against OS-level stopSelf vs stopService race.

UI (#665 from @cmptrnb):
- Test Relay button was showing red "test result: fail" status when used in full or direct mode. The underlying test_cmd::run deliberately refuses in those modes because probing Apps Script directly while the data plane goes via tunnel-node would give a misleading result, but the refuse path was getting translated to generic "test failed". UI now checks mode before running and shows a mode-specific explainer for full/direct (point users at https://whatismyipaddress.com in the browser via the proxy as the right way to verify).

Includes already-merged PR #674 from @yyoyoian-pixel: drop client coalesce_step + tunnel-node straggler settle_step from 40 ms → 10 ms, raise tunnel-node settle max from 500 ms → 1000 ms. Asymmetric tuning: fast-fire when nothing else is queued, but adaptive coalesce on bursts. Backwards compatible — existing configs with explicit `coalesce_step_ms: 40` keep old behavior.

Tests: 179 lib + 33 tunnel-node green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants