GEDD — find what your AI agent gets wrong

You shipped a product powered by AI Agents. Now you have to tell your CEO whether it's good enough — and if it isn't, tell engineering exactly what to fix. The agent fails in ways no rubric anticipated, and the eval tools your team installed expect you to know what to measure before you've seen what breaks.

GEDD is the tool for before you have a rubric.

The eval pipeline is the product. The agent is just the thing it produces.

📖 Read the why: Why Grounded Theory? for reliable AI Agents — the long-form argument behind this repo.

What you do in GEDD

Five steps, a conversational coach guiding you through each one. No YAML, no SDK, no Python.

Define your agent. What it's for, who uses it, what it should do.
Write a system prompt with the coach's help.
Generate golden test queries — happy path, edge cases, adversarial, ambiguous. The coach proposes; you keep what fits.
Run them, watch what breaks. Side-by-side across up to 3 models. Mark each response ✓ / ⚠ / ✗.
Name the failure patterns in your own words — "policy hallucination," "missed escalation," "tone collapse under hostility." GEDD turns those names into a deployable judge that engineering can wire into CI.

That's it. The whole flow takes about 90 minutes for a real agent with 8–12 golden queries. The first 30 minutes get you to "I now know my agent's top 3 failure modes." Most teams stop there and ship.

Quick start

cd grounded-evals
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
python -m grounded_evals.app

Open http://localhost:8080 — TravelBot loads automatically, no login required. Click through the tabs to explore the full pipeline.

To run against your own agent you'll need AWS credentials (Bedrock) or an ANTHROPIC_API_KEY. Set ADMIN_PASSWORD=your-password to enable the login wall for shared deployments.

Try it before you commit to it

The home page has 17 one-click demo scenarios — no LLM calls needed. Each is pre-loaded with golden queries, human annotations, error codes, a paradigm model, and a generated judge. Walk the entire pipeline in 5 minutes.

Demo	Domain	Key failure modes
TravelBot	Flight booking (SkyLink Travel)	Hallucinated entities, fabricated booking data, confident confabulation
ClinicalBot	Clinical triage (MedPulse Health)	Missed escalation, contraindication miss, overconfident diagnosis
LexBot	Legal assistant (Lexara Law Suite)	Jurisdiction error, unauthorized legal advice, statute misquote
WealthBot	Financial planning (PrimeWealth)	Unlicensed advice, projection hallucination, risk misclassification
HRBot	HR policy Q&A (TalentPulse)	Policy misquote, confidentiality breach, discriminatory guidance
EduBot	Student learning (Athena Learning)	Answer reveal, grade inflation, curriculum mismatch
VaultEx AI	Crypto exchange (VaultEx)	Regulatory misguidance, fee hallucination, wallet security gaps
PixelGuard	Gaming moderation (NexusGames)	False positive bans, harassment miss, appeals mishandling
InsureBot	Insurance claims (ShieldPoint)	Bad-faith denial, coverage hallucination, state regulation miss
PropBot	Real estate (NestKey Realty)	Fair Housing steering, fabricated comps, disclosure miss
RxBot	Pharmacy (PharmaLink)	Drug interaction miss, dosage unit confusion (mg vs mcg), off-label promotion
TaxBot	Tax/accounting (FileSmart)	Deduction hallucination, entity misguidance, Circular 230 violation
ClaimsBot	Defense contracting (AeroGuard)	ITAR violation, CUI spillage, foreign national access error
FoodBot	Food safety (SafePlate)	Allergen cross-contact miss, HACCP temp error, anaphylaxis delay
AutoBot	Automotive (DrivePulse Motors)	Lemon law omission, FTC CARS Rule violation, odometer fraud miss
MigrateBot	Immigration (PathForward Legal)	Asylum deadline miss, unauthorized practice, bar misapplication
EnergyBot	Energy/utilities (GridSync)	Solar ITC outdated (§25D terminated), NEM 3.0 confusion, DC voltage safety

Load any scenario and explore every tab — Eval, Tag, Root Causes, Build Judge, Report — all pre-populated.

Why this works

Most eval tools ask: what should we measure? — then build rubrics from assumptions. GEDD asks: what is actually happening? — then builds the rubric from evidence.

You can't evaluate what you haven't observed. Pre-baked rubrics miss the failures unique to your agent.
Criteria should be weighted by evidence. A bereavement-handling failure isn't the same severity as a tone slip.
Your evaluation evolves with the agent. New patterns surface as you ship; the methodology absorbs them naturally.
Your work becomes load-bearing. The judge GEDD generates is in your domain vocabulary, not a generic "helpfulness 1-5."

The methodology under the hood is grounded theory — the same discipline social scientists use to find patterns in human data. We use it to find patterns in agent failures. The full mapping lives in METHODOLOGY.md.

What it's not

Not a tracing or observability tool. It doesn't ingest live production traces. Bring your traces (paste them in, or run queries through GEDD itself).
Not a metric library. No pre-built "faithfulness," "hallucination index," or 20-evaluator zoo. You discover your metrics; the tool makes them deployable.
Not a one-shot rubric generator. It's a workflow, not a button. Plan ~90 minutes the first time.

For engineers: CLI and Claude Code skills

The web UI is built for PMs. If you'd rather stay in the terminal, there are two engineer-native paths that produce the same outputs — a Claude Code skill that runs the full pipeline conversationally, and a standalone CLI for scripting and CI.

Both read and write the same session.json file, so you can switch between them freely mid-session.

Claude Code skills

`/gedd-chat` — full pipeline in one conversation

cd grounded-evals
claude        # opens Claude Code CLI

/gedd-chat

Claude reads session.json if it exists and resumes where you left off. The full 6-step pipeline runs inside the conversation:

Step 1  Define Agent        Name, capabilities, target users, domain → saved to session.json
Step 2  System Prompt       Draft and refine collaboratively → saved to session.json
Step 3  Golden Queries      Open Coding: fracture domain → generate queries in batches
                            Live coverage table shown after every approved batch:

                            Saturation: happy_path 3/3 ✓ | edge_case 2/3 ~ | adversarial 1/3 ✗
                            Overall: 1/6 categories saturated (17%)

Step 4  Eval                Runs queries against your model inline — no CLI switch needed
Step 5  Annotation          Shows each Q/A pair in conversation, collects ✓/⚠/✗ + error codes,
                            writes annotations to session.json in real time
Step 6  Export              Summarizes failure modes, offers export/web UI/judge generation

Type quit at any point — state is saved after every turn.

`/gedd-status` — session dashboard

/gedd-status

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  GEDD Session Status
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

  Agent      : TravelBot
  Step       : 4 / 6  (Eval)
  Session    : session.json

  ── Golden Queries ──────────────────────────
  happy_path        █████   5   ✓ saturated
  edge_case         ███░░   3   ✓ saturated
  adversarial       ███░░   3   ✓ saturated

  ── What's next ─────────────────────────────
  Run evaluation → grounded-evals eval
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

CLI commands

Command	What it does
`chat`	Conversational coaching — Steps 1-4, saves to `session.json`
`eval`	Run golden queries against a model, save responses
`annotate`	Interactively mark responses correct / partial / incorrect
`status`	Terminal dashboard — agent, step, saturation, annotations, error codes
`judge`	Generate a deployable LLM-as-a-Judge prompt from your error codes
`export`	Write golden dataset as JSONL, CSV, or JSON
`fracture`	Fracture an agent spec YAML into test categories (Open Coding)
`check-saturation`	Check whether a dataset has reached theoretical saturation
`coverage`	Show a bar-chart coverage breakdown by category
`compare`	Check whether a new prompt adds unique coverage to a dataset
`serve`	Start the web UI

grounded-evals --help          # all commands
grounded-evals chat --help     # options for a specific command

Multiple agents

Both paths support --session to keep separate files per agent:

grounded-evals chat     --session travelbot.json
grounded-evals eval     --session travelbot.json
grounded-evals annotate --session travelbot.json
grounded-evals export   --session travelbot.json --format jsonl

How it actually feels

[ Home ]            One-click demos + your saved sessions
   ↓
[ Coach ]           Conversational. Define agent, system prompt,
                    golden queries — guided by an AI coach.
   ↓
[ Eval ]            Run queries against models. Mark ✓ / ⚠ / ✗.
   ↓
[ Tag Failures ]    Annotate what failed and why, in your own words.
                    Codes accumulate in a sidebar.
   ↓
[ Map Root Causes ] Drag your codes onto a paradigm canvas: causes,
                    contexts, consequences. Optional but useful.
   ↓
[ Build Judge ]     Generate a deployable judge prompt. Calibrate it
                    against your own scoring (κ ≥ 0.80). Export.

Guides and further reading

Guide	What it covers
Cohen's Kappa for LLM Judges	What κ is, how to compute it, how to interpret it, and how to iterate your rubric until κ ≥ 0.80
Building an LLM-as-a-Judge	Full rubric design, weighting, hard-fail rules, few-shot calibration, and export
Domain Expert Guide	End-to-end walkthrough of all 5 steps for PMs and SMEs
PM Artifacts → Production Judge	Step-by-step guide for ML engineers: turn golden queries, annotations, and codebook into a calibrated CI judge

For AWS Bedrock setup, environment variables, deployment, project structure, and contribution guidelines, see SETUP.md.

⭐ Found this useful?

If GEDD helped you find what your agent gets wrong, a star helps others find it too.

Support and license

Security issues: see CONTRIBUTING.

License: MIT-0. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 121 Commits
.github		.github
grounded-evals		grounded-evals
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
METHODOLOGY.md		METHODOLOGY.md
README.md		README.md
SETUP.md		SETUP.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GEDD — find what your AI agent gets wrong

What you do in GEDD

Quick start

Try it before you commit to it

Why this works

What it's not

For engineers: CLI and Claude Code skills

Claude Code skills

`/gedd-chat` — full pipeline in one conversation

`/gedd-status` — session dashboard

CLI commands

How it actually feels

Guides and further reading

⭐ Found this useful?

Support and license

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GEDD — find what your AI agent gets wrong

What you do in GEDD

Quick start

Try it before you commit to it

Why this works

What it's not

For engineers: CLI and Claude Code skills

Claude Code skills

/gedd-chat — full pipeline in one conversation

/gedd-status — session dashboard

CLI commands

How it actually feels

Guides and further reading

⭐ Found this useful?

Support and license

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

`/gedd-chat` — full pipeline in one conversation

`/gedd-status` — session dashboard

Packages