fix(tunnel): batch header read honors request_timeout_secs (#1088) by dazzling-no-more · Pull Request #1108 · therealaleph/MasterHttpRelayVPN-RUST

dazzling-no-more · 2026-05-12T19:46:31Z

Problem

Issue #1088: under Full mode, a single slow Apps Script edge cascade-killed every in-flight tunnel session sharing its batch. Users on 1.9.21+ saw frequent 10 s "batch timeout" errors and lost download progress on Telegram / browser sessions.

Root cause: read_http_response in domain_fronter.rs had a hardcoded 10 s header-read timeout that ran inside tunnel_batch_request_to — independent of and shorter than the outer tokio::time::timeout(batch_timeout, ...) in fire_batch. Apps Script cold starts routinely land in the 8-12 s range (PR #1040's A/B test recorded 4/30 H1 batches timing out at exactly 10 s after the H2→H1 switch), so the inner cliff was firing as a false-positive batch timeout well before request_timeout_secs (default 30 s) could.

Secondary issue: even with a parameterized timeout, the per-read timeout(d, stream.read(...)) form would silently extend its budget if a peer drip-fed bytes just under d each. A slow edge could keep the loop alive past the outer batch_timeout and defeat the whole wiring.

Fix

Two changes, both in domain_fronter.rs:

tunnel_batch_request_to passes batch_timeout to the header read via the new read_http_response_with_header_timeout helper. Config::request_timeout_secs is now the only knob controlling how long we wait for an Apps Script edge to start responding. Other callers (relay path, exit-node) keep the historical 10 s value.
Header read uses a single absolute deadline (tokio::time::timeout_at(deadline, ...)) instead of per-read timeout(). Total elapsed across all header reads is bounded by header_read_timeout, regardless of read cadence.

Bonus in tunnel_client.rs:

TunnelMux::reply_timeout co-varies with batch_timeout: computed at construction as fronter.batch_timeout() + 5 s slack instead of the fixed 35 s const. Operators raising request_timeout_secs no longer have sessions abandon reply_rx just before fire_batch's HTTP round-trip would complete.

What this does NOT do

Earlier iterations of this branch added client-side batch retry (re-fire the same payload against a different deployment on timeout). That code was removed before merge: tunnel-node's drain_now mutates the per-session buffer when building a poll response, so a lost response means lost bytes — replaying an empty poll silently skips bytes (gap on the client side), and replaying a payload op double-writes (corrupts TCP / double-delivers UDP). No op shape is safe to replay without server-side ack / sequence support. Sessions whose batch times out re-poll on the next tick, same recovery surface as pre-#1088.

Changes

domain_fronter.rs:
- tunnel_batch_request_to: pass batch_timeout to header read (+9 lines).
- Add read_http_response_with_header_timeout(stream, deadline) (+25 lines).
- Replace per-iteration timeout() with timeout_at(deadline, ...) in the header loop.
tunnel_client.rs:
- Add TunnelMux::reply_timeout field, derived from batch_timeout at construction.
- tunnel_loop reads mux.reply_timeout() instead of the REPLY_TIMEOUT const.
- Replace REPLY_TIMEOUT: Duration const with REPLY_TIMEOUT_SLACK: Duration = 5 s.

Test coverage

Three new tests, all using tokio::time::pause() so they run in < 1 s of wall time:

read_http_response_respects_configured_header_timeout — response starts at virtual T=15 s; must succeed under a 30 s budget. Catches any regression to the hardcoded 10 s inner cliff.
read_http_response_header_deadline_is_total_not_per_read — drip-feeds 38 header bytes at 3 s/byte. Each read returns within 3 s (< 10 s per-read budget) but total exceeds 10 s. Asserts FronterError::Timeout. Catches any regression to per-iteration timeout().
mux_reply_timeout_tracks_batch_timeout_plus_slack — constructs a real DomainFronter with request_timeout_secs: 60 via TunnelMux::start; asserts mux.reply_timeout() == 60 s + REPLY_TIMEOUT_SLACK. Catches any hardcoded-value regression in the runtime derivation.

Verification

cargo test: 231/231 ✅ (was 228 pre-patch + 3 new)
cargo check: clean ✅
cargo clippy -W unused: clean ✅

Interaction with PR #1040 / #1041 (H2 → H1 switch)

PR #1040's A/B numbers showed 4/30 H1 batches timing out at 10 s after the H2→H1 switch — that was this same 10 s inner cliff. This PR addresses the residual timeout regression #1040 introduced without reverting the H2→H1 decision, which remains correct for tunnel-batch coalesced ops.

Closes #1088.

…eph#1088)

dazzling-no-more · 2026-05-12T19:50:41Z

Confirmed by a fresh field report in #1106 — log shows rtt=10.301966975s followed by batch failed: timeout and a cascade of connect_data error: timeout across every session in the batch, on a 4-deployment config. Exact signature of the hardcoded 10s header-read cliff this PR removes.

Closes #1088, #1106, #1107.

@dazzling-no-more

#1088, #620) Bumps v1.9.23 → v1.9.24. Two PRs from @dazzling-no-more: - #1108 (#1088): batch header read honors request_timeout_secs. Closes the 10s inner timeout cliff that was cascade-killing tunnel sessions under slow Apps Script edges. +22 regression tests (231 total). - #1117 (#620): cargo-chef Dockerfile so tunnel-node builds without BuildKit. Cloud Run's gcloud-deploy path now works. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

fix(tunnel): batch header read honors request_timeout_secs (therealal…

6413c80

…eph#1088)

github-actions Bot added the type: fix fix: PR — auto-applied by release-drafter label May 12, 2026

therealaleph merged commit 4d135a4 into therealaleph:main May 13, 2026
1 check passed

therealaleph mentioned this pull request May 13, 2026

relay error: h1 open timed out after 8s #1131

Open

dazzling-no-more deleted the fix/full-tunnel-batch-timeout-cascade branch May 13, 2026 11:49

This was referenced May 13, 2026

is full mode good ? #1107

Closed

[v1.9.24] SLOW throughput and speed in FULL-TUNNEL mode #1147

Closed

Full mode is not working. #1172

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(tunnel): batch header read honors request_timeout_secs (#1088)#1108

fix(tunnel): batch header read honors request_timeout_secs (#1088)#1108
therealaleph merged 1 commit into
therealaleph:mainfrom
dazzling-no-more:fix/full-tunnel-batch-timeout-cascade

dazzling-no-more commented May 12, 2026

Uh oh!

dazzling-no-more commented May 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dazzling-no-more commented May 12, 2026

Problem

Fix

What this does NOT do

Changes

Test coverage

Verification

Interaction with PR #1040 / #1041 (H2 → H1 switch)

Uh oh!

dazzling-no-more commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dazzling-no-more commented May 12, 2026 •

edited

Loading