Skip to content

Parallelise skill file fetches and reuse HTTPS connections#5335

Merged
jamesbroadhead merged 3 commits into
mainfrom
jb/speed-up-skills-install
May 27, 2026
Merged

Parallelise skill file fetches and reuse HTTPS connections#5335
jamesbroadhead merged 3 commits into
mainfrom
jb/speed-up-skills-install

Conversation

@jamesbroadhead
Copy link
Copy Markdown
Contributor

@jamesbroadhead jamesbroadhead commented May 27, 2026

Summary

  • Hoist a single shared http.Client in libs/aitools/installer/installer.go so the transport pool reuses TCP+TLS connections across fetches. MaxIdleConnsPerHost is bumped from Go's default 2 → 16 so parallel fetches to raw.githubusercontent.com actually reuse connections instead of churning handshakes.
  • Fetch a skill's files concurrently in installSkillToDir via errgroup.WithContext with SetLimit(8). First error cancels in-flight peers, preserving the prior bail-on-first-error semantics.
  • Skills themselves are still installed serially across the outer loop. Most of the cold-start win therefore comes from the shared client keeping idle connections warm between sequential skills, rather than from intra-skill parallelism alone.

Why

databricks aitools install was sequential: every file constructed a fresh &http.Client{} and threw it away, paying the full TCP+TLS handshake per file. For --experimental (26 skills × ~5 files each = ~120 HTTPS GETs to GitHub raw) that meant wall-clock was dominated by handshake round-trips.

Benchmarked locally against https://raw.githubusercontent.com/databricks/databricks-agent-skills/v0.2.0 with a freshly built CLI on each side:

Condition Baseline This change Speedup
Cold (no DNS/CDN warm) 42.7 s 0.52 s ~80×
Warm 0.55–0.79 s 0.20–0.21 s ~3–4×

The cold-start number is what a first-time databricks aitools install --experimental user actually pays.

The fetchFileFn package var that tests stub remains unchanged, so no test mocks need updating.

Test plan

  • go test ./libs/aitools/... — all green, including -race -count=2.
  • gofmt -l libs/aitools/installer/installer.go — clean.
  • go vet ./libs/aitools/... — clean.
  • End-to-end timing against live GitHub raw — see table above.

This pull request and its description were written by Isaac.

@eng-dev-ecosystem-bot
Copy link
Copy Markdown
Collaborator

eng-dev-ecosystem-bot commented May 27, 2026

Commit: 48261fe

Run: 26502959179

Copy link
Copy Markdown
Member

@simonfaltum simonfaltum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Focused change, the two halves (parallel fetches + shared transport with bumped MaxIdleConnsPerHost) belong together. Tests verify the right invariants (both goroutines reach the fetcher before either releases; errgroup propagates cancellation to in-flight peers) and survived -race -count=50 locally.

Nits, none blocking:

  • The 8 concurrency limit and * 2 idle-conn slack would benefit from a one-line rationale.
  • sync.OnceValue is the codebase's lazy-init idiom (e.g. libs/clicompat/clicompat.go); the IIFE works fine since init is eager, just flagging for consistency.
  • Skills are still installed serially across the outer loop, so the cold-start win mostly comes from the shared client keeping idle conns warm between skills, not intra-skill parallelism. Worth a sentence in the PR body.
  • TestInstallSkillToDirCancelsInFlightFetchesOnError uses a 1s deadline; bumping to ~5s would be cheap CI insurance.

jamesbroadhead and others added 3 commits May 27, 2026 18:54
Each skill file was fetched sequentially with a fresh http.Client per call,
so every file paid the full TCP+TLS handshake to raw.githubusercontent.com
and `databricks aitools install --experimental` (26 skills, ~120 files) took
~40 s on a cold network. Hoist a shared http.Client with a tuned transport,
and fetch a skill's files concurrently with errgroup. Wall-clock drops to
under a second.

Co-authored-by: Isaac
Cover concurrent skill file fetches and errgroup cancellation so the installer performance path keeps its intended behavior.
- Switch the test's `sync.Once` to `sync.OnceFunc` to satisfy the forbidigo
  lint rule (the cause of the CI lint failure).
- Convert the package-level httpClient initialiser to `sync.OnceValue` for
  consistency with the codebase's lazy-init idiom (e.g. clicompat).
- Add a one-line rationale for the fetchConcurrency value (8) and the
  MaxIdleConnsPerHost * 2 idle-conn slack.
- Bump the cancellation test's outer deadline from 1s to 5s for CI slack.

Co-authored-by: Isaac
@jamesbroadhead jamesbroadhead force-pushed the jb/speed-up-skills-install branch from 48261fe to 5564300 Compare May 27, 2026 18:54
@jamesbroadhead jamesbroadhead added this pull request to the merge queue May 27, 2026
Merged via the queue into main with commit d62dcbb May 27, 2026
19 checks passed
@jamesbroadhead jamesbroadhead deleted the jb/speed-up-skills-install branch May 27, 2026 19:09
@eng-dev-ecosystem-bot
Copy link
Copy Markdown
Collaborator

Commit: d62dcbb

Run: 26532836091

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants