Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
891e6a4
Plan retries for AWF reflect fetch
Copilot May 28, 2026
f38a880
Add retries for AWF reflect fetch
Copilot May 28, 2026
5d80782
Tune smoke-pi reflect retry coverage
Copilot May 28, 2026
6a47a28
Refine reflect retry behavior and docs
Copilot May 28, 2026
313f1b0
Handle transient fetch-failed reflect errors
Copilot May 28, 2026
9f0cc9b
Retry fetch-failed reflect errors by nested code
Copilot May 28, 2026
436a62d
Tighten reflect fetch-failed retry matching
Copilot May 28, 2026
65e380a
Clarify transient network retry detection
Copilot May 28, 2026
1ae8f5f
Add curl probe for reflect fetch-failed diagnostics
Copilot May 28, 2026
138e600
Merge branch 'main' into copilot/api-proxy-reflect-call-resilience
github-actions[bot] May 28, 2026
2c81a19
Plan pi fetch diagnostics work
Copilot May 28, 2026
300d503
Add Pi fetch diagnostics for Node 24 startup
Copilot May 28, 2026
5c73c3e
Limit Pi diagnostic fetch body reads and tighten tests
Copilot May 28, 2026
3f195c5
Merge branch 'main' into copilot/api-proxy-reflect-call-resilience
pelikhan May 28, 2026
7ffdcfd
Enable verbose Node.js debug logging in Smoke Pi workflow
Copilot May 28, 2026
86da635
Merge branch 'main' into copilot/api-proxy-reflect-call-resilience
github-actions[bot] May 28, 2026
5cb05d7
Log proxy env vars before reflect fetch
Copilot May 28, 2026
d60b4af
Merge branch 'main' into copilot/api-proxy-reflect-call-resilience
github-actions[bot] May 28, 2026
3b2be5a
Bypass proxy for api-proxy host in Pi provider runtime
Copilot May 29, 2026
b7861f7
Bypass proxy for internal api-proxy host in Pi runtime
Copilot May 29, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 25 additions & 11 deletions .github/workflows/smoke-pi.lock.yml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

7 changes: 7 additions & 0 deletions .github/workflows/smoke-pi.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,13 @@ permissions:
contents: read
issues: read
pull-requests: read
env:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider documenting the rationale for AWF_REFLECT_MAX_ATTEMPTS=7 so future maintainers know why this specific cap was chosen.

NODE_DEBUG: "http,https,net,tls,undici"
AWF_REFLECT_MAX_ATTEMPTS: "7"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider documenting these AWF_REFLECT_* env vars in the workflow comment block for discoverability.

AWF_REFLECT_RETRY_BASE_MS: "1000"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 10s ceiling pairs well with the 1s base. Worth a short comment to call out the implied exponential backoff envelope.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: a base of 1000ms is fairly conservative — consider exposing the override path in a small README so contributors can tune for slow environments.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged by the smoke test run: review-comment reply path verified on the latest existing review comment.

Warning

Firewall blocked 6 domains

The following domains were blocked by the firewall during workflow execution:

  • accounts.google.com
  • android.clients.google.com
  • clients2.google.com
  • contentautofill.googleapis.com
  • safebrowsingohttpgateway.googleapis.com
  • www.google.com

To allow these domains, add them to the network.allowed list in your workflow frontmatter:

network:
  allowed:
    - defaults
    - "accounts.google.com"
    - "android.clients.google.com"
    - "clients2.google.com"
    - "contentautofill.googleapis.com"
    - "safebrowsingohttpgateway.googleapis.com"
    - "www.google.com"

See Network Configuration for more information.

📰 BREAKING: Report filed by Smoke Copilot · gpt55 4.3M

AWF_REFLECT_RETRY_MAX_MS: "10000"
AWF_PI_FETCH_DIAGNOSTICS_ENABLED: "1"
AWF_PI_FETCH_DIAGNOSTIC_URLS: "http://api-proxy:10000/reflect,https://github.com,https://api.github.com/meta"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping the diagnostic probes to known public endpoints makes the smoke signal deterministic and avoids accidentally logging private service URLs in the default workflow configuration.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Smoke review note: using fixed public diagnostic URLs keeps the probe deterministic and easier to compare across runs.

name: Smoke Pi
experiments:
sub_agent_decomposition:
Expand Down
222 changes: 180 additions & 42 deletions actions/setup/js/awf_reflect.cjs
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,22 @@ require("./shim.cjs");

const fs = require("fs");
const path = require("path");
const { withRetry } = require("./error_recovery.cjs");
const childProcess = require("child_process");
const { withRetry, isTransientError } = require("./error_recovery.cjs");

/**
* Parse a positive integer from an environment variable with fallback.
*
* @param {string} name
* @param {number} fallback
* @returns {number}
*/
function parsePositiveIntEnv(name, fallback) {
const raw = process.env[name];
if (!raw) return fallback;
const parsed = Number.parseInt(raw, 10);
return Number.isFinite(parsed) && parsed > 0 ? parsed : fallback;
}

// AWF API proxy management endpoint for discovering configured LLM providers and available models.
// The api-proxy sidecar exposes /reflect on its management port (port 10000) inside the AWF
Expand All @@ -30,6 +45,12 @@ const AWF_API_PROXY_REFLECT_URL = "http://api-proxy:10000/reflect";
const AWF_REFLECT_OUTPUT_PATH = "/tmp/gh-aw/sandbox/firewall/awf-reflect.json";
// Milliseconds to wait for the /reflect endpoint before giving up.
const AWF_REFLECT_TIMEOUT_MS = 60000;
// Maximum attempts for fetching /reflect when api-proxy startup is still in progress.
const AWF_REFLECT_MAX_ATTEMPTS = parsePositiveIntEnv("AWF_REFLECT_MAX_ATTEMPTS", 5);
// Base delay between /reflect retries. Uses exponential backoff.
const AWF_REFLECT_RETRY_BASE_MS = parsePositiveIntEnv("AWF_REFLECT_RETRY_BASE_MS", 500);
// Cap for exponential backoff delay between /reflect retries.
const AWF_REFLECT_RETRY_MAX_MS = parsePositiveIntEnv("AWF_REFLECT_RETRY_MAX_MS", 5000);
// Milliseconds to wait for each models_url fallback fetch (shorter than the main reflect timeout).
const AWF_MODELS_URL_TIMEOUT_MS = 3000;
// Maximum attempts for models_url fallback fetches when the proxy is not yet ready.
Expand All @@ -41,12 +62,71 @@ const AWF_MODELS_URL_RETRY_MAX_MS = 2000;
// Gemini model name prefix stripped from model IDs in the Gemini models API response.
// Example: { name: "models/gemini-1.5-pro" } → "gemini-1.5-pro"
const GEMINI_MODEL_NAME_PREFIX = "models/";
// HTTP statuses from api-proxy /reflect that are typically transient during startup.
const RETRYABLE_REFLECT_STATUS_CODES = [502, 503, 504];
const RETRYABLE_NETWORK_ERROR_CODES = new Set(["ECONNREFUSED", "ECONNRESET", "ENOTFOUND", "ETIMEDOUT", "EAI_AGAIN"]);
const PROXY_ENV_VAR_NAMES = ["HTTP_PROXY", "HTTPS_PROXY", "NO_PROXY", "ALL_PROXY", "http_proxy", "https_proxy", "no_proxy", "all_proxy"];

// Default logger used by fetchAWFReflect when no logger is provided via options.
// All lines are prefixed with "[awf-reflect]" for easy grepping in combined logs.
// prettier-ignore
const DEFAULT_REFLECT_LOGGER = /** @type {(msg: string) => void} */ (msg => process.stderr.write(`[awf-reflect] ${new Date().toISOString()} ${msg}\n`));

/**
* Best-effort network probe for /reflect using curl.
* This helps distinguish Node.js fetch transport issues from endpoint reachability issues.
*
* @param {string} reflectUrl
* @param {number} timeoutMs
* @param {(msg: string) => void} logger
* @returns {void}
*/
function runReflectCurlProbe(reflectUrl, timeoutMs, logger) {
const timeoutSeconds = Math.max(1, Math.ceil(timeoutMs / 1000));
const args = ["--silent", "--show-error", "--location", "--output", "/dev/null", "--write-out", "%{http_code}", "--max-time", String(timeoutSeconds), reflectUrl];
logger(`awf-reflect: running curl probe for ${reflectUrl}`);
try {
const result = childProcess.spawnSync("curl", args, {
encoding: "utf8",
timeout: Math.max(1000, timeoutMs + 1000),
maxBuffer: 1024 * 1024,
});
if (result.error) {
logger(`awf-reflect: curl probe failed: ${result.error.message}`);
return;
}

const status = (result.stdout || "").trim() || "n/a";
const stderr = (result.stderr || "").trim() || "none";
const exitCode = typeof result.status === "number" ? result.status : -1;
logger(`awf-reflect: curl probe exit=${exitCode} http_status=${status} stderr=${JSON.stringify(stderr)}`);
} catch (error) {
const message = error instanceof Error ? error.message : String(error);
logger(`awf-reflect: curl probe threw: ${message}`);
}
}

/**
* Redact URL credentials for proxy environment variable logging.
*
* @param {string|undefined} value
* @returns {string}
*/
function redactProxyEnvValue(value) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Smoke review note: this redaction helper is a good guardrail before proxy-related diagnostics reach logs.

if (!value) return "<unset>";
try {
const parsed = new URL(value);
if (parsed.username || parsed.password) {
parsed.username = "***";
parsed.password = "***";
return parsed.toString();
}
} catch {
// Keep non-URL values unchanged (for example NO_PROXY host lists).
}
return value;
}

/**
* Extract model IDs from a provider API response body.
*
Expand Down Expand Up @@ -217,6 +297,9 @@ async function enrichReflectModels(reflectData, timeoutMs, logger) {
* reflectUrl?: string,
* outputPath?: string,
* timeoutMs?: number,
* maxAttempts?: number,
* retryBaseMs?: number,
* retryMaxMs?: number,
* modelsTimeoutMs?: number,
* logger?: (msg: string) => void,
* writeFileSync?: (path: string, data: string, options: object) => void,
Expand All @@ -235,68 +318,120 @@ async function fetchAWFReflect(options) {
const reflectUrl = (options && options.reflectUrl) || AWF_API_PROXY_REFLECT_URL;
const outputPath = (options && options.outputPath) || AWF_REFLECT_OUTPUT_PATH;
const timeoutMs = options && options.timeoutMs != null ? options.timeoutMs : AWF_REFLECT_TIMEOUT_MS;
const configuredAttempts = options && options.maxAttempts != null ? options.maxAttempts : AWF_REFLECT_MAX_ATTEMPTS;
const maxAttempts = Math.max(1, configuredAttempts);
const retryBaseMs = options && options.retryBaseMs != null ? options.retryBaseMs : AWF_REFLECT_RETRY_BASE_MS;
const retryMaxMs = options && options.retryMaxMs != null ? options.retryMaxMs : AWF_REFLECT_RETRY_MAX_MS;
const modelsTimeoutMs = options && options.modelsTimeoutMs != null ? options.modelsTimeoutMs : AWF_MODELS_URL_TIMEOUT_MS;
const logger = (options && options.logger) || DEFAULT_REFLECT_LOGGER;
const writeFile = (options && options.writeFileSync) || fs.writeFileSync;

logger(`awf-reflect: fetching ${reflectUrl} (timeout=${timeoutMs}ms)`);

const ac = new AbortController();
let timedOut = false;
const timer = setTimeout(() => {
timedOut = true;
logger(`awf-reflect: request timed out after ${timeoutMs}ms`);
ac.abort();
}, timeoutMs);
const retryConfig = {
maxRetries: Math.max(0, maxAttempts - 1),
// withRetry doubles the delay after each failure. We halve the initial value so
// the first retry sleep is exactly retryBaseMs (instead of 2x retryBaseMs).
initialDelayMs: Math.ceil(retryBaseMs / 2),
maxDelayMs: retryMaxMs,
backoffMultiplier: 2,
jitterMs: 0,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jitterMs: 0 removes all retry jitter — fine for now, but leaves a thundering-herd trap if this path is hit by concurrent jobs.

With deterministic backoff and zero jitter, any N parallel jobs (e.g., a matrix run or multiple workflow instances) that encounter the same transient 502/503 will all retry at exactly the same intervals, concentrating load on a still-recovering api-proxy. The default retry config in error_recovery.cjs uses jitterMs: 100 precisely to avoid this.

💡 Suggested fix
jitterMs: Math.ceil(retryBaseMs / 4),  // spread retries without meaningfully increasing delays

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Smoke note: jitter is off here. Fine for deterministic tests; if many agents retry together later, jitter may be worth turning on.

shouldRetry: error => {
const original = error?.originalError || error;
const status = original?.status ?? original?.response?.status ?? null;
const shouldRetryStatus = RETRYABLE_REFLECT_STATUS_CODES.includes(status);
const hasRetryableErrorCode = [original?.code, original?.cause?.code].some(code => typeof code === "string" && RETRYABLE_NETWORK_ERROR_CODES.has(code));
const errorMessage = (original?.message || "").toLowerCase();
const looksLikeUndiciFetchFailure = original?.name === "TypeError" || errorMessage.includes("fetch failed");
const shouldRetryFetchFailure = hasRetryableErrorCode && looksLikeUndiciFetchFailure;
const shouldRetry = shouldRetryStatus || isTransientError(original) || shouldRetryFetchFailure;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timeout errors are retried via implicit isTransientError string-match — fragile coupling that can cause 7-minute hangs.

isTransientError matches the substring "timeout" in error messages. When AbortController fires, the inner catch creates timeoutError with message "request timed out after Xms" — which contains "timeout". So shouldRetry returns true for timeouts via isTransientError, not by intent. With smoke-Pi config (AWF_REFLECT_MAX_ATTEMPTS=7, timeoutMs=60000), worst-case = 7 × 60 s = 7+ minutes blocked before the outer catch fires.

💡 Suggested fix

Make the timeout-retry decision explicit so it survives future changes to isTransientError:

shouldRetry: error => {
  const original = error?.originalError || error;
  // Per-attempt timeout: opt-in to retry, don't rely on message substring.
  if (original?.reason === "timeout") return true; // or false, depending on intent
  const status = original?.status ?? original?.response?.status ?? null;
  // ... rest of checks
},

Either explicitly allow or disallow timeout retries here. Currently the behavior is accidental: if isTransientError is ever changed to use type checks instead of message substrings, timeout retries silently disappear with no failing test to catch it.

if (shouldRetry) {
logger(`awf-reflect: transient failure for ${reflectUrl}; retrying`);
}
return shouldRetry;
},
};

try {
const res = await fetch(reflectUrl, { signal: ac.signal });
if (!res.ok) {
logger(`awf-reflect: unexpected status ${res.status}, skipping`);
return {
ok: false,
reflectUrl,
outputPath,
reason: "unexpected_status",
status: res.status,
};
}
const reflectData = await res.json();
// Attempt to fill in null models for configured providers by fetching directly
// from each endpoint's models_url. The api-proxy injects auth headers when
// forwarding these requests, so this succeeds without needing the raw API keys.
await enrichReflectModels(reflectData, modelsTimeoutMs, logger);
const enrichedBody = JSON.stringify(reflectData);
fs.mkdirSync(path.dirname(outputPath), { recursive: true });
writeFile(outputPath, enrichedBody, { encoding: "utf8" });
logger(`awf-reflect: saved ${enrichedBody.length}B to ${outputPath}`);
return {
ok: true,
reflectUrl,
outputPath,
bytesWritten: enrichedBody.length,
};
const proxyEnvSummary = PROXY_ENV_VAR_NAMES.map(name => `${name}=${JSON.stringify(redactProxyEnvValue(process.env[name]))}`).join(" ");
logger(`awf-reflect: proxy env ${proxyEnvSummary}`);
logger(`awf-reflect: fetching ${reflectUrl} (timeout=${timeoutMs}ms, max_attempts=${maxAttempts})`);
return await withRetry(
async () => {
const ac = new AbortController();
let timedOut = false;
const timer = setTimeout(() => {
timedOut = true;
logger(`awf-reflect: request timed out after ${timeoutMs}ms`);
ac.abort();
}, timeoutMs);
try {
const res = await fetch(reflectUrl, { signal: ac.signal });
if (!res.ok) {
if (RETRYABLE_REFLECT_STATUS_CODES.includes(res.status)) {
const err = Object.assign(new Error(`reflect fetch returned ${res.status} for ${reflectUrl}`), { status: res.status });
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good: retryable startup HTTP statuses become thrown errors, so the shared retry path can handle them instead of returning early.

throw err;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Smoke review note: deterministic retry delay is good for tests. If this path fans out later, adding jitter would help avoid synchronized retries.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Smoke agent reply test. Me acknowledge thread.

Warning

Firewall blocked 6 domains

The following domains were blocked by the firewall during workflow execution:

  • accounts.google.com
  • android.clients.google.com
  • clients2.google.com
  • contentautofill.googleapis.com
  • safebrowsingohttpgateway.googleapis.com
  • www.google.com

To allow these domains, add them to the network.allowed list in your workflow frontmatter:

network:
  allowed:
    - defaults
    - "accounts.google.com"
    - "android.clients.google.com"
    - "clients2.google.com"
    - "contentautofill.googleapis.com"
    - "safebrowsingohttpgateway.googleapis.com"
    - "www.google.com"

See Network Configuration for more information.

📰 BREAKING: Report filed by Smoke Copilot · gpt55 4.1M

logger(`awf-reflect: unexpected status ${res.status}, skipping`);
return {
ok: false,
reflectUrl,
outputPath,
reason: "unexpected_status",
status: res.status,
};
}
const reflectData = await res.json();
// Attempt to fill in null models for configured providers by fetching directly
// from each endpoint's models_url. The api-proxy injects auth headers when
// forwarding these requests, so this succeeds without needing the raw API keys.
await enrichReflectModels(reflectData, modelsTimeoutMs, logger);
const enrichedBody = JSON.stringify(reflectData);
fs.mkdirSync(path.dirname(outputPath), { recursive: true });
writeFile(outputPath, enrichedBody, { encoding: "utf8" });
logger(`awf-reflect: saved ${enrichedBody.length}B to ${outputPath}`);
return {
ok: true,
reflectUrl,
outputPath,
bytesWritten: enrichedBody.length,
};
} catch (err) {
const e = /** @type {Error} */ err;
if (e.name === "AbortError") {
const timeoutError = Object.assign(new Error(timedOut ? `request timed out after ${timeoutMs}ms` : e.message), { reason: "timeout" });
throw timeoutError;
}
throw e;
} finally {
clearTimeout(timer);
}
},
retryConfig,
`awf-reflect fetch for ${reflectUrl}`
);
} catch (err) {
const e = /** @type {Error} */ err;
if (e.name === "AbortError") {
const original = e?.originalError || e;
if (original?.reason === "timeout") {
return {
ok: false,
reflectUrl,
outputPath,
reason: "timeout",
error: timedOut ? `request timed out after ${timeoutMs}ms` : e.message,
error: original.message,
};
}
logger(`awf-reflect: request failed: ${e.message}`);
const errorMessage = String(original?.message || e.message || "").toLowerCase();
if (original?.name === "TypeError" || errorMessage.includes("fetch failed")) {
runReflectCurlProbe(reflectUrl, timeoutMs, logger);
}
logger(`awf-reflect: request failed: ${original.message || e.message}`);
return {
ok: false,
reflectUrl,
outputPath,
reason: "request_failed",
error: e.message,
error: original.message || e.message,
};
} finally {
clearTimeout(timer);
}
}

Expand All @@ -305,6 +440,9 @@ if (typeof module !== "undefined" && module.exports) {
AWF_API_PROXY_REFLECT_URL,
AWF_REFLECT_OUTPUT_PATH,
AWF_REFLECT_TIMEOUT_MS,
AWF_REFLECT_MAX_ATTEMPTS,
AWF_REFLECT_RETRY_BASE_MS,
AWF_REFLECT_RETRY_MAX_MS,
AWF_MODELS_URL_TIMEOUT_MS,
AWF_MODELS_URL_MAX_ATTEMPTS,
AWF_MODELS_URL_RETRY_BASE_MS,
Expand Down
Loading