Skip to content

fix(tunnel): batch header read honors request_timeout_secs (#1088)#1108

Merged
therealaleph merged 1 commit into
therealaleph:mainfrom
dazzling-no-more:fix/full-tunnel-batch-timeout-cascade
May 13, 2026
Merged

fix(tunnel): batch header read honors request_timeout_secs (#1088)#1108
therealaleph merged 1 commit into
therealaleph:mainfrom
dazzling-no-more:fix/full-tunnel-batch-timeout-cascade

Conversation

@dazzling-no-more
Copy link
Copy Markdown
Contributor

Problem

Issue #1088: under Full mode, a single slow Apps Script edge cascade-killed every in-flight tunnel session sharing its batch. Users on 1.9.21+ saw frequent 10 s "batch timeout" errors and lost download progress on Telegram / browser sessions.

Root cause: read_http_response in domain_fronter.rs had a hardcoded 10 s header-read timeout that ran inside tunnel_batch_request_to — independent of and shorter than the outer tokio::time::timeout(batch_timeout, ...) in fire_batch. Apps Script cold starts routinely land in the 8-12 s range (PR #1040's A/B test recorded 4/30 H1 batches timing out at exactly 10 s after the H2→H1 switch), so the inner cliff was firing as a false-positive batch timeout well before request_timeout_secs (default 30 s) could.

Secondary issue: even with a parameterized timeout, the per-read timeout(d, stream.read(...)) form would silently extend its budget if a peer drip-fed bytes just under d each. A slow edge could keep the loop alive past the outer batch_timeout and defeat the whole wiring.

Fix

Two changes, both in domain_fronter.rs:

  1. tunnel_batch_request_to passes batch_timeout to the header read via the new read_http_response_with_header_timeout helper. Config::request_timeout_secs is now the only knob controlling how long we wait for an Apps Script edge to start responding. Other callers (relay path, exit-node) keep the historical 10 s value.

  2. Header read uses a single absolute deadline (tokio::time::timeout_at(deadline, ...)) instead of per-read timeout(). Total elapsed across all header reads is bounded by header_read_timeout, regardless of read cadence.

Bonus in tunnel_client.rs:

  1. TunnelMux::reply_timeout co-varies with batch_timeout: computed at construction as fronter.batch_timeout() + 5 s slack instead of the fixed 35 s const. Operators raising request_timeout_secs no longer have sessions abandon reply_rx just before fire_batch's HTTP round-trip would complete.

What this does NOT do

Earlier iterations of this branch added client-side batch retry (re-fire the same payload against a different deployment on timeout). That code was removed before merge: tunnel-node's drain_now mutates the per-session buffer when building a poll response, so a lost response means lost bytes — replaying an empty poll silently skips bytes (gap on the client side), and replaying a payload op double-writes (corrupts TCP / double-delivers UDP). No op shape is safe to replay without server-side ack / sequence support. Sessions whose batch times out re-poll on the next tick, same recovery surface as pre-#1088.

Changes

  • domain_fronter.rs:
    • tunnel_batch_request_to: pass batch_timeout to header read (+9 lines).
    • Add read_http_response_with_header_timeout(stream, deadline) (+25 lines).
    • Replace per-iteration timeout() with timeout_at(deadline, ...) in the header loop.
  • tunnel_client.rs:
    • Add TunnelMux::reply_timeout field, derived from batch_timeout at construction.
    • tunnel_loop reads mux.reply_timeout() instead of the REPLY_TIMEOUT const.
    • Replace REPLY_TIMEOUT: Duration const with REPLY_TIMEOUT_SLACK: Duration = 5 s.

Test coverage

Three new tests, all using tokio::time::pause() so they run in < 1 s of wall time:

  • read_http_response_respects_configured_header_timeout — response starts at virtual T=15 s; must succeed under a 30 s budget. Catches any regression to the hardcoded 10 s inner cliff.
  • read_http_response_header_deadline_is_total_not_per_read — drip-feeds 38 header bytes at 3 s/byte. Each read returns within 3 s (< 10 s per-read budget) but total exceeds 10 s. Asserts FronterError::Timeout. Catches any regression to per-iteration timeout().
  • mux_reply_timeout_tracks_batch_timeout_plus_slack — constructs a real DomainFronter with request_timeout_secs: 60 via TunnelMux::start; asserts mux.reply_timeout() == 60 s + REPLY_TIMEOUT_SLACK. Catches any hardcoded-value regression in the runtime derivation.

Verification

  • cargo test: 231/231 ✅ (was 228 pre-patch + 3 new)
  • cargo check: clean ✅
  • cargo clippy -W unused: clean ✅

Interaction with PR #1040 / #1041 (H2 → H1 switch)

PR #1040's A/B numbers showed 4/30 H1 batches timing out at 10 s after the H2→H1 switch — that was this same 10 s inner cliff. This PR addresses the residual timeout regression #1040 introduced without reverting the H2→H1 decision, which remains correct for tunnel-batch coalesced ops.

Closes #1088.

@github-actions github-actions Bot added the type: fix fix: PR — auto-applied by release-drafter label May 12, 2026
@dazzling-no-more
Copy link
Copy Markdown
Contributor Author

dazzling-no-more commented May 12, 2026

Confirmed by a fresh field report in #1106 — log shows rtt=10.301966975s followed by batch failed: timeout and a cascade of connect_data error: timeout across every session in the batch, on a 4-deployment config. Exact signature of the hardcoded 10s header-read cliff this PR removes.

Closes #1088, #1106, #1107.

@therealaleph therealaleph merged commit 4d135a4 into therealaleph:main May 13, 2026
1 check passed
therealaleph added a commit that referenced this pull request May 13, 2026
#1088, #620)

Bumps v1.9.23 → v1.9.24. Two PRs from @dazzling-no-more:
- #1108 (#1088): batch header read honors request_timeout_secs.
  Closes the 10s inner timeout cliff that was cascade-killing tunnel
  sessions under slow Apps Script edges. +22 regression tests (231 total).
- #1117 (#620): cargo-chef Dockerfile so tunnel-node builds without
  BuildKit. Cloud Run's gcloud-deploy path now works.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dazzling-no-more dazzling-no-more deleted the fix/full-tunnel-batch-timeout-cascade branch May 13, 2026 11:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

type: fix fix: PR — auto-applied by release-drafter

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Slow download speeds for small files

2 participants