chore: extract main into an embeddable internal/app package#5259
chore: extract main into an embeddable internal/app package#5259siavashs wants to merge 2 commits into
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (3)
🚧 Files skipped from review as they are similar to previous changes (3)
📝 WalkthroughWalkthroughThis PR extracts Alertmanager's runtime initialization from the main.go file into a new internal/app package. The refactoring introduces a composable ChangesApp Package Refactoring and Extraction
Sequence DiagramsequenceDiagram
participant Run as Run(ctx, opts)
participant New as New(opts)
participant Setup as setup()
participant Subsystems as Core Subsystems
participant Coordinator as ConfigCoordinator
Run->>New: construct App
New->>Setup: wire subsystems
Setup->>Subsystems: initialize metrics, nflog, silences, alerts
Setup->>Subsystems: create dispatcher via pipeline
Setup->>Coordinator: create with reload callback
Setup->>Coordinator: apply initial config
Coordinator->>Subsystems: rebuild routes/inhibitor/dispatcher on reload
New-->>Run: return App
Run->>Run: start app + block serveLoop
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cmd/alertmanager/main.go`:
- Line 205: The shutdown log is hardcoded to "Received SIGTERM…" which
misreports SIGINT/Ctrl+C; change the logger.Info call in main.go (the location
that currently logs "Received SIGTERM, exiting gracefully...") to log the actual
signal received (use the signal variable from the signal.Notify/select or, if
you cancel via ctx, log ctx.Err() or a generic "shutting down" message) so the
message reflects the real cause; update the handler that calls app.Run and the
signal.Notify/select branch to pass the received os.Signal (or its String())
into logger.Info instead of the fixed "SIGTERM" text.
In `@internal/app/lifecycle.go`:
- Around line 142-163: The Stop method can block forever on the "for range
a.srvc" if Start's serve goroutine never closes a.srvc; change Stop to perform a
non-blocking drain of a.srvc instead of a blocking range so Stop returns safely
even if Start wasn't run. Specifically, update App.Stop to replace the for range
over a.srvc with a loop that repeatedly attempts a non-blocking receive from
a.srvc (e.g., select with a receive case and a default case) until the channel
is drained/closed or there is nothing to read; reference symbols: App.Stop,
a.srvc, Start, New.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: cfbc0c18-07d3-4da0-85e9-8fea36348cfa
📒 Files selected for processing (9)
cmd/alertmanager/main.gointernal/app/app.gointernal/app/cluster.gointernal/app/lifecycle.gointernal/app/lifecycle_test.gointernal/app/metrics.gointernal/app/options.gointernal/app/url.gointernal/app/url_test.go
Move the body of run() from cmd/alertmanager/main.go into a new internal/app package so Alertmanager can be embedded in tests and other binaries without shelling out to a compiled binary. Resolves the long-standing TODO from prometheus#406. cmd/alertmanager/main.go shrinks from 724 to 196 lines and is now responsible only for kingpin flag parsing, logger construction, versioncollector registration, feature-flag / GOMEMLIMIT side effects, and translating OS signals into context cancellation (SIGINT/SIGTERM) plus reload events (SIGHUP) consumed by app.Run. The new internal/app package is split into: * options.go - Options struct, validate(), DefaultClusterAddr * app.go - Run(ctx, opts) error * metrics.go - per-instance Prometheus metrics struct * cluster.go - clusterWait helper * url.go - extURL helper (+ url_test.go for TestExternalURL) The six previously package-level promauto.NewXxx variables in cmd/alertmanager/main.go are now constructed per Run() invocation against opts.Registerer. Combined with threading the registerer through every collaborator (versioncollector excepted, which stays in main.go as a process-global), this unblocks running multiple Alertmanager instances in the same process without duplicate- registration panics. Behavioural notes: * prometheus.DefaultRegisterer is no longer referenced inside app.Run; the binary still passes it in via Options.Registerer so on-disk behaviour is identical. * app.Run defers srv.Shutdown(5s) on exit. Previously the deferred srv.Close lived inside the listen goroutine and never ran in practice because os.Exit killed the process first. Behaviour for the binary is unchanged; embedded callers now get clean HTTP teardown. * --cluster.listen-address default moved from a const in cmd/alertmanager to the exported app.DefaultClusterAddr. Known follow-ups intentionally out of scope: * matcher/compat.InitFromFlags still mutates package-level state; multi-instance tests with different feature flags will collide. * Richer App lifecycle (New/Start/Addr/Reload/Stop) for tests that need :0-port discovery or programmatic reload. * Migrating the v2 acceptance harness to use app.Run directly instead of building and spawning the binary. Verification: `go build ./...`, `go vet ./...`, and `go test -count=1 ./...` all pass, including the existing test/with_api_v2/acceptance suite which continues to build and spawn the binary end-to-end. Closes prometheus#406 Signed-off-by: Siavash Safi <siavash@cloudflare.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/app/lifecycle.go`:
- Around line 96-113: The Start method can deadlock because the registered
/-/reload handler blocks on sending to the unbuffered a.webReload channel (and
errors from web.ServeMultiple are sent to a.srvc) while the only consumer
(serveLoop, invoked by Run) may not be running for embedders; fix by ensuring
Start spawns the reload-and-error drain loop so a.webReload and a.opts.Reload
are drained even when Run/serveLoop is not used, or alternatively make the
reload handler do a non-blocking send/fail-fast: add a goroutine in Start that
runs the same logic as serveLoop (draining a.webReload, a.opts.Reload and
forwarding errors to a.reload/ reload handler) and ensure web.ServeMultiple
errors sent to a.srvc are observed (do not close a.srvc before draining), or
change the handler to select { case a.webReload <- errc: default: respond with
an immediate error } so the handler never blocks when the drain loop is absent;
update Start, the /-/reload handler, and any use of a.srvc/a.webReload
accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 6a09cb35-1507-4077-b2b1-721ea7afc541
📒 Files selected for processing (9)
cmd/alertmanager/main.gointernal/app/app.gointernal/app/cluster.gointernal/app/lifecycle.gointernal/app/lifecycle_test.gointernal/app/metrics.gointernal/app/options.gointernal/app/url.gointernal/app/url_test.go
🚧 Files skipped from review as they are similar to previous changes (7)
- internal/app/url_test.go
- internal/app/options.go
- internal/app/cluster.go
- internal/app/metrics.go
- internal/app/url.go
- internal/app/lifecycle_test.go
- internal/app/app.go
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
internal/app/lifecycle.go (1)
53-55: ⚡ Quick winStale doc comments still point at
serveLoop.
webReloadis now consumed byreloadRouter, notserveLoop. The same staleness applies to theReloaddocstring on Line 175 ("Safe to call concurrently with serveLoop"), sinceserveLoopno longer routes reloads. In code this deadlock-sensitive, accurate "who consumes this channel" comments matter for future maintainers.📝 Suggested doc fixes
// webReload is the channel exposed by httpserver.Register for the - // /-/reload HTTP endpoint. We read from it in serveLoop. + // /-/reload HTTP endpoint. We read from it in reloadRouter. webReload chan chan error// Reload triggers a configuration reload (the programmatic equivalent of -// SIGHUP). Safe to call concurrently with serveLoop. +// SIGHUP). Safe to call concurrently with the running App. func (a *App) Reload(_ context.Context) error {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/app/lifecycle.go` around lines 53 - 55, Doc comments are stale: update the comment for the webReload channel and the Reload docstring (mentions of "serveLoop") to reflect that reloads are now consumed by reloadRouter, not serveLoop; locate the declaration webReload and the Reload method/docstring and change references from serveLoop to reloadRouter and adjust wording about concurrency to say "Safe to call concurrently with reloadRouter" (or similar) so the consumer is accurate and deadlock-sensitive guidance is preserved.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/app/lifecycle_test.go`:
- Around line 161-168: The tests TestApp_EmbeddedReloadDoesNotDeadlock and
TestApp_New_SetupFailureDoesNotDeadlock use require.NoError/Equal/Error inside
spawned goroutines (the anonymous go func that closes done), which can call
t.FailNow from a child goroutine; change these to not call require from the
goroutine: either (A) replace require.* with assert.* inside the goroutine
(e.g., assert.NoError/assert.Equal/assert.Error) or (B) capture the goroutine
results by sending error/status values down a channel (use the existing done
channel or a new result channel) and perform require.* assertions on those
results in the main test goroutine after <-done; update the anonymous functions
and their callers (the POST to "/-/reload" and the setup-failure goroutine) to
use one of these patterns so all require.* calls run on the main test goroutine.
---
Nitpick comments:
In `@internal/app/lifecycle.go`:
- Around line 53-55: Doc comments are stale: update the comment for the
webReload channel and the Reload docstring (mentions of "serveLoop") to reflect
that reloads are now consumed by reloadRouter, not serveLoop; locate the
declaration webReload and the Reload method/docstring and change references from
serveLoop to reloadRouter and adjust wording about concurrency to say "Safe to
call concurrently with reloadRouter" (or similar) so the consumer is accurate
and deadlock-sensitive guidance is preserved.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 7df0d969-feab-4bf1-8cf6-684bb59bd394
📒 Files selected for processing (3)
internal/app/app.gointernal/app/lifecycle.gointernal/app/lifecycle_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- internal/app/app.go
introduce an App lifecycle so tests and embedders can drive Alertmanager without OS signals or os/exec, and discover the bound HTTP address even when listening on ":0". API: New(opts) (*App, error) (*App).Start() error (*App).Addr() string // first listener (*App).Addrs() []string // all listeners (*App).Reload(ctx) error (*App).Stop(ctx) error Run is preserved as a thin wrapper (New + Start + serveLoop + Stop) with a deferred Stop on a fresh 30s context so cleanup also runs on panic, matching the implicit panic-safety of the previous defer- based implementation. Internally, setup uses a cleanup stack (a.onStop) that Stop drains in LIFO order, mirroring Go's defer semantics so the source order of the old `defer X` lines in Run is preserved verbatim and the shutdown ordering does not depend on hand-written reverse-order code. Listeners are bound at New time via a new listenAll helper that calls net.Listen directly (so Addr is meaningful before Start); web.ServeMultiple is then invoked in Start. Systemd socket activation is not supported when embedding and returns an explicit error pointing callers back to cmd/alertmanager. Stop honors its context parameter for the HTTP shutdown step, capped at 5s, so callers passing a tighter deadline get faster teardown and callers passing context.Background get the default. Tests cover: single instance round-trip; two sequential instances in the same process (guards the Phase A metrics-per-Registerer fix against duplicate-registration panics); two concurrent instances on distinct ephemeral ports; and the Run wrapper end-to-end with ctx cancellation. All pass under -race. Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Summary
Resolves the long-standing TODO from #406 by extracting the Alertmanager process logic out of
cmd/alertmanager/main.gointo a newinternal/apppackage, and giving it a lifecycle API (New/Start/Addr/Reload/Stop) so tests and other binaries can embed Alertmanager in-process instead of building and shelling out to the compiled binary.cmd/alertmanager/main.goshrinks from 724 → 196 lines and now owns only: kingpin flag parsing, logger construction,versioncollectorregistration, feature-flag / GOMEMLIMIT side effects, and translating OS signals into context cancellation (SIGINT/SIGTERM) plus reload events (SIGHUP) consumed byapp.Run.Commits
1.
cmd/alertmanager: extract main into internal/app packageMechanical extraction. The new package is split into focused files:
internal/app/options.go—Optionsstruct,validate,DefaultClusterAddrinternal/app/app.go—Run(ctx, opts) error(orchestration body)internal/app/metrics.go— per-instance Prometheus metrics structinternal/app/cluster.go—clusterWaithelperinternal/app/url.go—extURLhelper (+url_test.goforTestExternalURL)The six previously package-level
promauto.NewXxxvariables incmd/alertmanager/main.goare now constructed perRun()invocation againstopts.Registerer. Combined with threading the registerer through every collaborator (versioncollectorexcepted, which stays process-global inmain.go), this unblocks running multiple Alertmanager instances in the same process without duplicate-registration panics.2.
internal/app: add App lifecycle (New/Start/Addr/Reload/Stop)Adds a richer lifecycle API on top of the extraction:
Runis preserved as a thin wrapper (New+Start+serveLoop+Stop) with a deferredStopon a fresh 30s context so cleanup also runs on panic, matching the implicit panic-safety of the previous defer-based implementation.Internally,
setupuses a cleanup stack (a.onStop) thatStopdrains in LIFO order, mirroring Go'sdefersemantics so the source order of the olddefer Xlines inRunis preserved verbatim and the shutdown ordering does not depend on hand-written reverse-order code. Listeners are bound atNewtime via a newlistenAllhelper that callsnet.Listendirectly (soAddris meaningful beforeStart);web.ServeMultipleis then invoked inStart. Systemd socket activation is not supported when embedding and returns an explicit error pointing callers back tocmd/alertmanager.Stophonors its context parameter for the HTTP shutdown step, capped at 5s, so callers passing a tighter deadline get faster teardown and callers passingcontext.Backgroundget the default.Behavioural notes
prometheus.DefaultRegistereris no longer referenced insideapp.Run; the binary still passes it in viaOptions.Registererso on-disk behaviour is identical.srv.Shutdownnow actually runs onRunexit (previously the deferredsrv.Closelived inside the listen goroutine and never ran in practice becauseos.Exitkilled the process first). Behaviour for the binary is unchanged; embedded callers now get clean HTTP teardown.tracingManager.Stopis part of the cleanup stack and therefore always runs, not just onctx.Done()(previously leaked on listen failure, but the leak was masked byos.Exit).--cluster.listen-addressdefault moved from a const incmd/alertmanagerto the exportedapp.DefaultClusterAddr.Known follow-ups (out of scope)
matcher/compat.InitFromFlagsstill mutates package-level state; multi-instance tests with different feature flags will collide. Tracked separately.app.Rundirectly instead of building and spawning the binary. Now mechanically possible thanks toAddr()/Stop()on*App; left for a follow-up PR to keep this one reviewable.Verification
New tests in
internal/app/lifecycle_test.go:TestApp_StartStop— boot, probe/-/healthy, stop, stop again (idempotency).TestApp_TwoSequentialInstances— same process, two consecutiveNew → Start → Stopcycles. Guards the metrics-per-Registerer fix against duplicate-registration panics.TestApp_TwoConcurrentInstances— two live instances on different ephemeral ports simultaneously.TestApp_Run_ContextCancel— end-to-endRunwrapper with ctx cancellation.Diff size
Closes #406
Pull Request Checklist
Please check all the applicable boxes.
benchstatto compare benchmarksWhich user-facing changes does this PR introduce?
Summary by CodeRabbit
Refactor
Tests