Skip to content

Cross-LLM consistency — model-awareness, evaluation pipeline, and prompt hardening #127

@abeltrano

Description

@abeltrano

Issue: Enhancement: Cross-LLM consistency — model-awareness, evaluation pipeline, and prompt hardening

Summary

PromptKit is designed as a model-agnostic prompt composition system, but the prompts it assembles are ultimately executed by LLMs — and different LLMs interpret the same instructions with measurably different fidelity. A prompt that reliably produces a well-structured investigation-report on GPT-4o may produce an incomplete or re-ordered output on a smaller model, or may omit epistemic labels on a model that wasn't trained to follow that convention strictly.

This issue proposes a plan to make PromptKit model-aware and to drive toward predictable, deterministic outputs regardless of which LLM executes the assembled prompt.


Problem Statement

PromptKit's value proposition is that composing the right persona + protocols + format + template produces a reliable, high-quality output. But "reliable" is currently an implicit assumption — there is no mechanism to:

  1. Measure how much output quality and structure vary across LLMs for a given prompt
  2. Identify which prompt components or phrasings are fragile across model families
  3. Harden components against known model-specific failure modes
  4. Signal to users which templates have been validated on which models

The practical effect is that a PromptKit user running review-code on Claude Sonnet gets a different experience than one running it on GPT-4.1 or Gemini 2.0 Flash — not because the task differs, but because the prompt is inadvertently model-tuned by whoever authored it.


Dimensions of Variation (Known Risk Areas)

Dimension Example Failure Mode
Format adherence Model re-orders sections, omits required fields, invents section names
Protocol compliance Model skips phases (e.g., hypothesis generation), treats multi-phase protocol as a checklist
Epistemic labeling Model omits KNOWN/INFERRED/ASSUMED tags or uses them inconsistently
Section completeness Model writes "None" instead of "None identified" or omits empty sections entirely
Instruction following precision Model ignores quantitative constraints (e.g., "re-verify 3–5 specific claims")
Non-goal enforcement Model expands scope beyond stated non-goals
Self-verification depth Model produces shallow verification ("I have reviewed the above") vs. genuine re-checking

Proposed Enhancement Plan

Phase 1 — Evaluation Framework

Define a prompt portability evaluation methodology:

  • Select a representative set of PromptKit templates (covering each category)
  • Define golden inputs: deterministic, minimal input fixtures (e.g., a known buggy C snippet for investigate-bug)
  • Define a scoring rubric for each template covering: section presence, field completeness, protocol phase coverage, epistemic label usage, non-goal adherence
  • Execute each template × input pair against a matrix of target LLMs
  • Record structured results (pass/fail per rubric criterion, plus qualitative notes)

New component candidates:

  • Protocol: model-portability — authoring guidelines that make PromptKit components robust across model families (e.g., prefer numbered phases over bullet lists, always use imperative mood, avoid ambiguous pronouns, bound instruction scope explicitly)
  • Template: evaluate-prompt-portability — systematic evaluation of a PromptKit prompt against multiple LLMs using a scoring rubric
  • Format: portability-report — structured output capturing per-model, per-criterion scores and recommended prompt changes

Phase 2 — CI Pipeline Integration

Integrate evaluation into CI/CD:

  • Add a GitHub Actions workflow that runs a selected subset of golden-input × template pairs against configurable LLM endpoints (using GitHub Models or other API)
  • The workflow compares structured output against rubric expectations (regex/schema checks for required fields, section headers, epistemic label presence)
  • Failures surface as PR checks — a protocol change that breaks format adherence on a target model is caught before merge
  • Results are stored as workflow artifacts for trend analysis over time

Open questions:

  • Which LLMs to target in CI? (Cost, API availability, model stability — suggest: Claude Sonnet, GPT-4o, Gemini 2.0 Flash, Llama 3 as a baseline)
  • Should evaluation be per-PR (expensive) or nightly (cheaper, lower signal)?
  • How to handle non-determinism — temperature=0 where supported, seeded prompts?

Phase 3 — Prompt Hardening Feedback Loop

Use evaluation data to improve PromptKit components:

  • For each discovered fragility, trace it to the responsible component (persona, protocol, format, or template)
  • Apply targeted rewrites following the model-portability protocol
  • Re-evaluate after the rewrite to confirm regression closure
  • Add model_notes to template frontmatter recording known limitations and validated models:
model_notes:
  validated_on: [claude-sonnet-4, gpt-4o, gemini-2.0-flash]
  known_issues:
    - model: gpt-4.1-mini
      issue: "Omits Phase 3 self-verification step; adds a shallow summary instead"
      workaround: "Add explicit 'You MUST execute Phase 3...' reminder at end of protocol"

Phase 4 — Model Compatibility Matrix (Documentation)

Publish a model compatibility matrix in the docs:

  • Per-template, per-model compatibility scores (Verified ✅ / Partial ⚠️ / Known Issues ❌ / Not Tested ?)
  • Guidance for users on which models to prefer for high-stakes tasks
  • Link to evaluation run artifacts for auditability

Scope of Changes

Area Change
protocols/guardrails/ New model-portability.md protocol
templates/ New evaluate-prompt-portability.md template
formats/ New portability-report.md format
manifest.yaml Register new components
.github/workflows/ New evaluate-portability.yml CI workflow
docs/ Model compatibility matrix
tests/ Golden input fixtures + rubric definitions

Success Criteria

  • At least 10 representative templates evaluated against ≥ 3 LLMs
  • Evaluation results are reproducible (deterministic inputs, recorded outputs)
  • At least one prompt hardening cycle completed and verified
  • CI workflow runs evaluations and reports pass/fail per template × model
  • model_notes frontmatter populated for all evaluated templates
  • Model compatibility matrix published in docs

Related

  • This enhancement extends the existing self-verification and anti-hallucination guardrails — those protocols assume the model will follow instructions, but don't harden the instructions against model-specific failure modes.
  • The extend-library interactive template is the recommended entry point for designing the new components (model-portability, evaluate-prompt-portability, portability-report).
  • The profile-session template (session log analysis) is complementary — it can help identify which protocol phases are being skipped in practice.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions