When to use this runbook: a Powernode service is degraded or down, agent activity is misbehaving, or a security event needs containment. Use the Triage decision tree to find the right section quickly.
- Triage decision tree
- Severity definitions
- The kill switch
- Service-degradation incidents
- AI agent misbehavior incidents
- Security incidents
- Communications
- Post-mortem template
Is the platform doing something it should not be doing right now?
├── YES — agents are taking unwanted actions, secrets are leaking,
│ compromised account is acting
│ → flip THE KILL SWITCH (see below), then proceed
└── NO — platform is unable to do something it should do
(API down, requests failing, jobs stuck)
→ SERVICE-DEGRADATION INCIDENTS section
Is there a security dimension (auth bypass, data exfil, code
execution, key disclosure)?
├── YES → SECURITY INCIDENTS section (do not skip kill switch step)
└── NO → continue with above branch
| Severity | Definition | Target response time |
|---|---|---|
| SEV1 | Customer-facing outage, data loss in progress, active security breach | Acknowledge in 5 min, all-hands |
| SEV2 | Significant degradation, agent autonomy malfunctioning, one-tenant impact | Acknowledge in 15 min, primary on-call |
| SEV3 | Recoverable degradation, single-user impact, internal-only outage | Acknowledge in 1 hour, normal hours OK |
| SEV4 | Minor regression, cosmetic issue, no customer impact | Next business day |
If unsure, escalate one level higher. Downgrading mid-incident is fine; upgrading after an under-call is not.
Powernode has a global AI activity kill switch. Use it when AI agents are taking unwanted actions, an extraction-style attack is in progress, or you need a clean stop before triage.
Via MCP (preferred — auditable):
platform.emergency_halt(reason: "<one-sentence reason>")
Via API:
TOKEN=$(curl -s -X POST http://localhost:3000/api/v1/auth/login \
-H "Content-Type: application/json" \
-d '{"email":"admin@yourdomain","password":"..."}' \
| python3 -c "import sys,json; print(json.load(sys.stdin)['data']['access_token'])")
curl -X POST http://localhost:3000/api/v1/ai/kill_switch/halt \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"reason":"<one-sentence reason>"}'What it does:
- Marks all AI activity as suspended in
ai_kill_switch_state. - Existing in-flight AI jobs (worker
ai_agents,ai_orchestration,ai_executionqueues) check the suspension flag and bail out at safe checkpoints. Tool calls that have already been emitted will complete; new calls are blocked. - The chat surface refuses new prompt submissions.
- Non-AI workloads (webhook delivery, billing, report generation) are not affected.
platform.kill_switch_status()
Returns { halted: bool, halted_at: ..., halted_by: ..., reason: ... }.
Only after root cause is understood and remediation is in place:
platform.emergency_resume(reason: "<remediation summary>")
Source: server/app/services/ai/autonomy/kill_switch_service.rb.
- Check service status:
sudo scripts/systemd/powernode-installer.sh status journalctl -u powernode-backend@default -n 100 --no-pager
- Look for recent deploys:
git -C /opt/powernode log --since="1 hour ago" --oneline— most outages correlate with a deploy. - Check the database:
cd server && bundle exec rails db:migrate:status | tail -5— a half-applied migration locks tables. - Check Redis (Sidekiq queue, ActionCable):
redis-cli -n 1 PING,redis-cli -n 1 LLEN reportsfor queue depth. - Rollback if recent deploy is suspect — see production-deployment.md#rollback.
Symptoms: ReportRequest.where(status: 'processing').count growing, agent executions never completing.
- Worker process up?
sudo systemctl status powernode-worker@default
- Drain didn't complete cleanly? Per
feedback_service_restarts, wait 30s after restart before troubleshooting; Sidekiq drain is normal. - Stuck worker — full restart:
Do NOT use
sudo systemctl stop powernode-worker@default && sleep 30 && sudo systemctl start powernode-worker@default
restart— it skips drain. Stop + start gives a clean state. - Backend HTTP API on port 4567 refused? Restart
powernode-worker-web@default, NOTpowernode-worker@default(perfeedback_worker_web_port). - Inspect dead jobs:
redis-cli -n 1 LLEN sidekiq:dead. If non-empty, jobs are exhausting retries — investigate the exception class.
- Connection check:
psql -h localhost -U powernode -c "SELECT now();" - Connection pool exhausted:
SELECT count(*), state FROM pg_stat_activity GROUP BY state;. Idle-in-transaction > 50% means a leaked connection. - Disk full:
df -h /var/lib/postgresql/— Postgres dies hard at 100% disk; clear old WAL or expand volume. - Lock contention:
SELECT * FROM pg_stat_activity WHERE wait_event IS NOT NULL;.
If unrecoverable, escalate to disaster scenario in postgres-backup.md#disaster-scenarios.
Symptoms include: agent loops, runaway costs, agents writing to wrong resources, agents leaking secrets in conversation, autonomy taking unapproved actions.
- Engage kill switch FIRST. Triage second. This costs nothing if the suspicion is wrong.
- Identify the agent:
platform.list_agents(status: "executing") platform.recent_events(limit: 50) - Inspect the most recent executions:
platform.agent_introspect(agent_id: "<id>") - Check intervention policies:
A missing or mis-configured policy is the most common cause.
platform.list_intervention_policies(agent_id: "<id>") - Pause the specific agent (preferred over kill switch once scope is known):
platform.update_agent(id: "<id>", status: "paused") - Cost spike? Check
platform.cost_analysis()for the spending shape. Throttle via the agent'smax_cost_per_runbudget setting.
After remediation, lift the kill switch and monitor platform.recent_events() for 30 minutes before declaring the incident resolved.
- Engage kill switch to prevent further automated actions with the leaked credential.
- Rotate the credential:
- User: force password reset via admin UI or
User.find(...).request_password_reset! - API key:
User.find(...).api_keys.find(...).revoke! - Worker JWT:
Worker.find(...).rotate_token!(or restart the worker — JWT is short-lived) - Vault-backed secret: rotate via Vault, then
VaultCredentialre-fetches on next access
- User: force password reset via admin UI or
- Audit the trail —
AuditLog.where(user_id: ..., created_at: <leak window>).order(:created_at). - Notify affected accounts if any non-self resources were accessed during the window.
- Kill switch.
- Identify the access pattern: query
audit_logsfor the actor and resource_type within the suspected window. - Pull the access logs from the load balancer / reverse proxy (Traefik) for HTTP-level evidence.
- Snapshot the database before any remediation (
backup-database.sh "incident_$(date +%s)") — preserves evidence. - Engage legal/compliance per your regulatory obligations.
- Take affected host(s) off the load balancer immediately — do not stop services first (running services preserve evidence in memory).
- Snapshot disk + memory (
virsh snapshot-createif KVM, oraws ec2 create-snapshotif EC2). - Engage external IR if you have a retainer. The cost of an outside firm is small compared to mishandled forensics.
- Rotate ALL credentials reachable from the compromised host — assume worst case.
Powernode does not ship a status page out of the box. If you don't have one, in-platform notifications via:
platform.send_proactive_notification(
account_id: "all" | "<specific>",
title: "...",
body: "...",
severity: "high"
)
reach logged-in users immediately via ActionCable.
| Severity | Update cadence | Channel |
|---|---|---|
| SEV1 | Every 15 min until resolution | War-room |
| SEV2 | Every 30 min | Incident channel |
| SEV3 | Hourly | Incident channel |
If user data is affected — even potentially — communicate proactively. Template:
Subject: Service incident — [brief description]
Hi,
At [start time UTC], we detected [symptom]. Our team engaged immediately and [restored service | confirmed contained] at [end time UTC].
Impact: [what users experienced]
Cause: [if known, plain language; if not, "investigation ongoing"]
What we're doing: [remediation steps + timeline for full post-mortem]
If you have questions, reply to this email.
Run within 5 business days of any SEV1/SEV2. Blameless format.
# Incident YYYY-MM-DD: [short title]
**Severity:** SEV[1|2|3|4]
**Duration:** [start UTC] — [end UTC] ([N] hours [M] minutes)
**Detected by:** [alert | customer report | engineer noticed]
**Resolved by:** [name / on-call rotation]
## Impact
- Number of accounts affected:
- User-visible symptoms:
- Data integrity impact (none / data loss / corruption):
- SLA breach: yes/no, by [N] minutes
## Timeline (all times UTC)
| Time | Event |
|------|-------|
| HH:MM | First symptom (in logs / metrics / user report) |
| HH:MM | Alert fired / report received |
| HH:MM | Engineer acknowledged |
| HH:MM | Root cause identified |
| HH:MM | Remediation applied |
| HH:MM | All-clear |
## What happened (technical)
[Narrative: state of system before, change that triggered, propagation, observable effects.]
## Root cause
[The actual underlying cause, not the proximate trigger. Use 5-whys if the cause is non-obvious.]
## What went well
- [What allowed faster detection / shorter MTTR]
## What went poorly
- [Gaps in monitoring, runbooks, tooling]
## Action items
| Action | Owner | Due | Status |
|--------|-------|-----|--------|
| [Fix root cause permanently] | | | open |
| [Add monitoring/alert that would have caught this] | | | open |
| [Update runbook section X] | | | open |
## Lessons (for future on-call rotations)
[1-2 bullet points worth remembering. Add to `platform.create_learning` so other operators benefit.]- production-deployment.md — deploy / rollback procedures
- postgres-backup.md — disaster recovery
- worker-operations.md — worker queue/job operations
- docker-swarm.md — Swarm-specific operations
- ai-operations.md — AI agent operations
- observability.md — log aggregation + monitoring