A collection of hard-won systems engineering knowledge -- the traps you walk into when you don't think about cardinality, idempotency, cache invalidation, and the other concepts that only matter at scale. Designed for Claude Code, Cursor, and any agent that writes infrastructure code on your behalf.
npx skills add triggerdotdev/staff-engineering-skillsAI coding agents write code that works on day one and breaks on day ninety. They generate solutions that pass tests with small datasets, handle the happy path gracefully, and completely fall apart under real production conditions.
These skills inject systems engineering knowledge directly into the agent's context at the moment it matters -- when it's making an architectural decision, writing a database query, designing a queue consumer, or choosing how to cache something.
Each skill activates contextually based on what the agent is doing. No slash commands, no manual invocation. The agent writes code that touches a cache, and the cache invalidation skill is there. The agent adds retry logic, and the retry storms skill is there.
Each skill covers:
- Detection Heuristics -- patterns to watch for in code ("if you see X, stop and think")
- Correct Patterns -- what to do instead, with code examples
- Anti-Patterns -- what not to do, with explanations of why
- Related Traps -- links to other skills that interact with this one
| # | Trap | One-liner |
|---|---|---|
| 01 | Cardinality | You stored one thing per user. Now you have a million users. |
| 02 | Denormalization | You optimized for reads. Now your writes are a nightmare. |
| 03 | Streams vs Batch | You processed everything in a loop. Now you need it in real-time. |
| 04 | Object Store as Database | S3 is a database now. But only if you use conditional writes. |
| 05 | Race Conditions | It works every time. Except when two things happen at once. |
| 06 | Idempotency | The request was sent twice. Now the customer was charged twice. |
| 07 | Sharding | One database was fine. Now you need twelve and a routing layer. |
| 08 | Consistency Models | You read your own write. Except you didn't. |
| 09 | Distributed System Fallacies | The network is reliable. Except it isn't. |
| 10 | Cache Invalidation | You cached it for performance. Now it's stale and no one knows. |
| 11 | Memory Leaks | It runs fine for a day. On day seven the OOM killer visits. |
| 12 | Backpressure | You can produce faster than you can consume. Now what? |
| 13 | Thundering Herd | The cache expired. Ten thousand requests hit the database at once. |
| 14 | Hot Partitions | You sharded perfectly. One shard is on fire. |
| 15 | Retry Storms | The service is slow. Your retries made it slower. |
| 16 | Clock Skew | You used timestamps for ordering. Time disagreed. |
Each skill is a SKILL.md file with YAML frontmatter containing a name and description. The description tells the agent runtime when to activate the skill -- no manual invocation needed.
For example, the race conditions skill activates when the agent writes code that "reads then writes shared state, checks a condition then acts on it, creates records that should be unique, updates counters or balances, transitions status fields, or handles webhook/queue retries."
Skills reference each other. Sharding links to hot partitions. Cache invalidation links to thundering herd. Retry storms link to idempotency. The skills form a graph, not a list.
Candidates that didn't make the initial cut:
- Connection Pool Exhaustion -- running out of database connections under load
- Schema Migrations at Scale -- ALTER TABLE on a billion-row table
- Observability Cardinality -- your metrics have more unique label combinations than data points
- Poison Pill Messages -- one bad message in a queue blocks everything behind it
- Lease/Lock Expiry -- your distributed lock expired while you were still holding it
staff-engineering-skills/
├── README.md
├── LICENSE
├── CONTRIBUTING.md
├── CODE_OF_CONDUCT.md
├── .github/ # Issue and PR templates
└── skills/ # Skill files (what gets installed)
├── staff-engineering-skills-cardinality/
│ └── SKILL.md
├── staff-engineering-skills-denormalization/
│ └── SKILL.md
├── staff-engineering-skills-streams-vs-batch/
│ └── SKILL.md
├── staff-engineering-skills-object-store-as-database/
│ └── SKILL.md
├── staff-engineering-skills-race-conditions/
│ └── SKILL.md
├── staff-engineering-skills-idempotency/
│ └── SKILL.md
├── staff-engineering-skills-sharding/
│ └── SKILL.md
├── staff-engineering-skills-consistency-models/
│ └── SKILL.md
├── staff-engineering-skills-distributed-system-fallacies/
│ └── SKILL.md
├── staff-engineering-skills-cache-invalidation/
│ └── SKILL.md
├── staff-engineering-skills-memory-leaks/
│ └── SKILL.md
├── staff-engineering-skills-backpressure/
│ └── SKILL.md
├── staff-engineering-skills-thundering-herd/
│ └── SKILL.md
├── staff-engineering-skills-hot-partitions/
│ └── SKILL.md
├── staff-engineering-skills-retry-storms/
│ └── SKILL.md
└── staff-engineering-skills-clock-skew/
└── SKILL.md
- Detection over prevention. Skills teach agents to recognize when they're about to walk into a trap. "If you see X, stop and think about Y."
- Concrete over abstract. Real code patterns, real failure modes, real numbers. No hand-waving about "consider scalability."
- Composable. Traps reference each other. The skills form a dependency graph that mirrors how these problems interact in production.
- Agent-native. Written for LLM context windows. Concise, pattern-matchable, with clear "if you see X, do Y" heuristics.
- Tradeoffs, not rules. Every pattern has a tradeoff callout. There are no silver bullets, only informed decisions.