Skip to content

Add restore network telemetry#265

Merged
sjmiller609 merged 1 commit into
mainfrom
hypeship/restore-net-telemetry-v2
Jun 1, 2026
Merged

Add restore network telemetry#265
sjmiller609 merged 1 commit into
mainfrom
hypeship/restore-net-telemetry-v2

Conversation

@sjmiller609
Copy link
Copy Markdown
Collaborator

@sjmiller609 sjmiller609 commented Jun 1, 2026

Summary

  • adds nested OpenTelemetry spans for restore-time network work, so allocation, snapshot network preparation, TAP creation, bridge attachment, isolation, and rate-limit setup show up separately in traces
  • carries inherited hypervisor trace attributes into network-manager spans so restore traces can be correlated across instance restore and network setup
  • keeps the change telemetry-only: allocation, TAP, bridge, rate-limit, and restore control flow behavior is unchanged

Tests

  • git diff --check origin/main...HEAD
  • go test ./lib/network -count=1
  • go test -tags containers_image_openpgp ./lib/instances -run TestCreateInstanceClearsRetentionStateBeforeMetadataSave -count=1

@sjmiller609 sjmiller609 force-pushed the hypeship/restore-network-v2 branch from 0e63fd2 to 4d71b40 Compare June 1, 2026 14:16
Base automatically changed from hypeship/restore-network-v2 to main June 1, 2026 14:33
@sjmiller609 sjmiller609 force-pushed the hypeship/restore-net-telemetry-v2 branch from 7eaed76 to e11b918 Compare June 1, 2026 14:37
@sjmiller609 sjmiller609 marked this pull request as ready for review June 1, 2026 14:53
@sjmiller609 sjmiller609 requested review from hiroTamada and rgarcia June 1, 2026 14:53
@sjmiller609 sjmiller609 merged commit 51ccad4 into main Jun 1, 2026
15 of 17 checks passed
@sjmiller609 sjmiller609 deleted the hypeship/restore-net-telemetry-v2 branch June 1, 2026 14:57
@firetiger-agent
Copy link
Copy Markdown

Monitoring Plan: OTel Tracing for Network Allocation and TAP Creation

What this PR does: Adds OpenTelemetry tracing instrumentation to the hypeman network allocation layer — covering IP/MAC/TAP creation and standby restore network setup. Improves debuggability of network failures with no behavioral changes.

Intended effect:

  • New network.* and restore_network.* spans: baseline 0 spans visible (none existed pre-deploy); confirmed if spans named network.get_default_network, network.create_tap, restore_network.create_allocation appear in the trace pipeline post-deploy.
  • Instance spawn rate: baseline 330K–835K/hr active hours; confirmed if rate remains within this range after deploy.

Risks:

  • Context deadline propagation — if span contexts inadvertently shorten timeouts, TAP device creation could fail; alert if kernel_invocation_spawn_total drops below 300K/hr during active hours (Mon 09:00–19:00 UTC).
  • Network allocation errorsfailed to allocate network / failed to recreate network ERROR logs on hypeman nodes; alert if any sustained increase appears post-deploy (baseline: negligible).
  • API error rate regression — API ERROR log count; alert if sustained above 80K/hr (baseline 47K–52K/hr).

Status updates will be posted automatically on this PR as monitoring progresses.

View monitor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants