SynSmith

Multi-Objective Prompt Debugging for Realistic, Diverse, and Attribute-Controlled Synthetic Data Generation

An open-source framework that reframes synthetic data generation as iterative, critic-guided prompt optimization.

Paper: SynSmith: Adversarial Multi-Critic Prompt Debugging for Synthetic Data Generation (Apartsin & Aperstein) - HTML with KaTeX math, or download .docx

Paper · Overview · Method · Quickstart · Architecture · Research questions · Citation

Overview

Large language models are increasingly used to bootstrap labeled data when real examples are scarce, expensive, private, or hard to annotate. In practice, naive prompting produces datasets that are repetitive, over-polished, and only shallowly tied to the requested labels. The standard fix is to hand-tune the prompt, which scales poorly and loses information about why a given sample went wrong.

SynSmith treats synthetic data generation as a closed-loop optimization problem. A generator LLM produces samples conditioned on explicit target attribute vectors. Seven LLM critics, three baseline (attribute verifier, realism discriminator, diversity auditor) and four GAN-style adversaries (Pack Discriminator, Mode-Seeking, Mode Hunter with persistent memory, Coverage Hole Finder via density-ratio estimation), score the batch along independent axes. A prompt updater consumes their structured feedback and rewrites the generator prompt for the next round.

The result is a GAN-style process in which the optimized variable is the prompt rather than the weights, with four simultaneous objectives:

attribute fidelity · realism · diversity · batch-level coverage

The framework ships three example datasets across difficulty regimes: customer-support intent classification (5 classes, 40 real examples), Banking77 cards-and-payments (10 classes, 300 real-train, 400 held-out test), and TREC question-type classification (6 classes, 60 real-train, 89 held-out test). The released artifacts include all per-seed runs, raw critic outputs, aggregation scripts, and the cross-condition classifier ensemble harness.

Headline results

Customer-support (N=10 seeds). Cross-condition classifier ensembling reaches macro F1 $0.947 \pm 0.056$, $+0.073$ over the three-critic baseline solo (BCa 95% CI $[+0.009, +0.141]$, excludes zero) and $+0.233$ on worst-class F1 (BCa 95% CI $[+0.067, +0.500]$, excludes zero), with $1.65\times$ lower seed variance than any individual condition.

Cross-task synth-only headline (N=5 seeds, sentence-transformer + LR on full canonical held-out splits).

Dataset	n_test	Real-only	SynSmith synth-only	Δ vs real	σ reduction
SST-2	872	0.704	0.731 ± 0.029 ✓	+0.027	1.7×
Banking77	400	0.950	0.876 ± 0.012	-0.074	5.8×
TREC	89	0.607	0.609 ± 0.056	+0.002	0.7×

The synthetic batch is at-or-above the real-only baseline on SST-2 and TREC, closes most of the gap on Banking77 with 5.8× seed-variance reduction (class-balanced planner + regen-on-rejection eliminating per-class starvation). Iteration injects a measurable 2× semantic-diversity gain over non-iterated baselines (Vendi 19.3 vs 10.1).

Method

The system implements one of the few prompt-optimization loops with multiple, orthogonal critic objectives. Each iteration runs the following procedure:

P_t  : current generator prompt
A    : attribute schema  (e.g. label × difficulty × ambiguity × style × noise)
R    : small real dataset (50–200 examples)

# 1. plan
targets ← AttributePlanner(A, history) ........... target attribute vectors
# 2. generate
S_t ← Generator(P_t, R, targets) ................. synthetic samples
# 3. critique
V_t ← AttributeVerifier(S_t, A) .................. per-sample attribute audits
D_t ← RealismDiscriminator(R ∪ S_t) .............. real-vs-synthetic verdicts
C_t ← DiversityAuditor(S_t, A) ................... batch-level coverage report
# 4. update
P_{t+1} ← PromptUpdater(P_t, V_t, D_t, C_t)

The discriminator's accuracy on the mixed batch is the realism signal: a healthy run drives it toward chance level. The auditor reports per-attribute coverage, near-duplicate rate, and named missing modes. The verifier flags specific failed attributes per sample. All three feedback signals are serialized into the updater's prompt template, so the rewrite is grounded in named failures rather than free-form self-critique.

Why three critics

Critic	Question it answers	Failure mode it prevents	Class
Attribute Verifier	Does the text reflect the requested vector?	Metadata-only labels: the right attribute string with mismatched text	baseline
Realism Discriminator	Can a judge separate synthetic from real?	Over-polished, template-y, telltale LLM phrasing	baseline
Diversity Auditor	Does the batch cover the attribute space?	Mode collapse, shallow paraphrases, missing rare/edge cases	baseline
Pack Discriminator	Can a judge separate k-sample packs of real vs synthetic?	Batch-level homogeneity invisible to per-sample realism judges	PacGAN analog
Mode-Seeking	Does attribute variation produce surface variation?	Attribute-deaf generation: same text for different attribute vectors	MSGAN analog
Mode Hunter	Which LLM tics appear in synth but not real?	Recurring banned phrasings, opener tics, structural templates	ban-list training
Coverage Hole Finder	Which real examples does the synthetic batch fail to cover?	Distributional coverage holes the discriminator alone misses	density-ratio coverage

Removing any one critic produces a measurably degraded distribution along its axis; the seven-critic loop is referenced in the paper as full_attrforge and is the default configuration when synsmith run is invoked without an ablation flag.

Quickstart

Install

git clone https://github.com/ApartsinProjects/SynSmith.git
cd SynSmith
pip install -e ".[openai]"        # or .[anthropic], or .[all]

Dry run (no API key required)

The echo backend exercises the full pipeline offline against a stubbed model. Useful for CI, smoke tests, and reading the on-disk run layout.

synsmith run examples/customer_support/config.echo.yaml

Real run with OpenAI

export OPENAI_API_KEY=sk-...
synsmith run examples/customer_support/config.yaml --iterations 3
synsmith inspect runs/<run_id>

Run the seven-condition paper experiment (customer-support or Banking77)

# Customer-support, 10 seeds, 7 conditions, 3 iterations, 16 samples/iter
python scripts/run_experiments.py \
  --config examples/customer_support/config.yaml \
  --conditions naive few_shot self_critique realism_only diversity_only full_classic full_attrforge \
  --seeds 17 23 41 53 89 101 109 127 137 149 \
  --iterations 3 --samples-per-iteration 16 \
  --run-id main_run_002

# Banking77 cards-and-payments, 5 seeds
python scripts/run_experiments.py \
  --config examples/banking77/config.yaml \
  --conditions naive few_shot self_critique realism_only diversity_only full_classic full_attrforge \
  --seeds 17 23 41 53 89 \
  --iterations 3 --samples-per-iteration 16 \
  --run-id banking77_run_001

# Cross-condition classifier ensembling (the headline analysis)
python scripts/ensemble_deep.py --base main_run_002

Programmatic API

from synsmith import SynSmith

forge = SynSmith.from_config("examples/customer_support/config.yaml")
result = forge.run(iterations=3)

print(result.final_prompt)
print(result.metric_history[-1])
# {'attribute_match_rate': 0.92,
#  'discriminator_accuracy': 0.58,
#  'pack_accuracy': 0.53,
#  'mode_seeking_ratio': 0.18,
#  'hunter_library_size': 11,
#  'coverage_auroc': 0.99,
#  'near_duplicate_rate': 0.04,
#  'combination_coverage': 0.83, ...}

Adding your own critic

Every critic implements the same protocol (name, evaluate(batch, real, attrs) -> StructuredFeedback). To add a fifth GAN-style adversary or a domain-specific verifier:

# synsmith/critics/my_critic.py
from synsmith.schema import Critic, StructuredFeedback, NamedComplaint

class MyCritic(Critic):
    name = "my_critic"
    def evaluate(self, batch, real, attrs):
        return StructuredFeedback(
            critic=self.name,
            metrics={"my_score": 0.42},
            complaints=[NamedComplaint(tag="opener-tic", reason="every sample opens 'Hi team'")],
        )

Then wire it into synsmith/baselines.py and add a flag in the ablation table. The updater template will render its complaints alongside the existing critics automatically; no change to the loop or the prompt-update logic is required.

Architecture

                       ┌─────────────────────────┐
                       │   Attribute Schema A    │
                       │  + small real set R     │
                       └────────────┬────────────┘
                                    │
                                    ▼
                       ┌─────────────────────────┐
                       │   Attribute Planner     │      (stratified or
                       │                         │       coverage-gap)
                       └────────────┬────────────┘
                                    │  target attribute vectors
                                    ▼
                       ┌─────────────────────────┐
                       │      Generator G        │ ◄────────────────────────┐
                       │  (current prompt P_t)   │                          │
                       └────────────┬────────────┘                          │
                                    │  synthetic samples S_t                │
              ┌─────────────────────┼─────────────────────┐                 │
              ▼                     ▼                     ▼                 │
   ┌────────────────────┐ ┌────────────────────┐ ┌─────────────────────┐    │
   │ Attribute Verifier │ │  Realism           │ │  Diversity Auditor  │    │
   │  per-sample        │ │  Discriminator     │ │  batch-level        │    │
   │  failed-attr list  │ │  acc, conf, reason │ │  coverage, modes    │    │
   └─────────┬──────────┘ └─────────┬──────────┘ └──────────┬──────────┘    │
             │                      │                       │                │
             └──────────────┬───────┴──────────────┬────────┘                │
                            ▼                      ▼                         │
                       ┌─────────────────────────────────┐                   │
                       │       Prompt Updater U          │                   │
                       │   P_{t+1} = U(P_t, feedback)    │ ──────────────────┘
                       └─────────────────────────────────┘

Every component lives behind a small typed interface (synsmith/schema.py). Each can be swapped without touching the loop, which is what makes the ablations cheap. The whole run, prompts, targets, samples, verdicts, reports, and metrics, is persisted under runs/<id>/ so experiments are reproducible.

runs/<id>/
  config.yaml
  schema.yaml
  real_examples.jsonl
  manifest.json                 ← metric_history, prompt_history
  iter_000/
    prompt.txt
    targets.jsonl
    samples.jsonl
    attribute_verdicts.jsonl
    realism_verdicts.jsonl
    diversity_report.json
    metrics.json
  iter_001/ ...

Research questions

This codebase is designed to make each of the following questions answerable with an ablation flag, not a re-implementation.

RQ1. Attribute fidelity. Does iterative prompt debugging improve the rate at which generated samples satisfy their requested attribute vector? Metrics: per-attribute precision, recall, F1, and total attribute-match accuracy.

RQ2. Realism. Does discriminator-guided prompt debugging drive an LLM judge's real-vs- synthetic accuracy toward chance? Metrics: discriminator accuracy, synthetic detection rate, calibration. A successful loop produces near-50% accuracy in balanced settings, with named artifacts disappearing over iterations.

RQ3. Diversity. Does coverage-guided debugging produce broader semantic coverage? Metrics: per-attribute entropy (normalized), combination coverage across pairs of attributes, embedding/TF-IDF near-duplicate rate, and a qualitative audit of missing modes.

RQ4. Downstream usefulness. Does SynSmith-generated data improve held-out real-test performance for a downstream classifier, especially on rare, hard, or ambiguous slices? Baselines include naive prompting, few-shot prompting, self-critique, diversity-only, realism-only, and human prompt refinement.

Baselines included

Baseline	`--conditions` flag	Description
Naive prompting	`naive`	One manually written prompt, no critic loop
Few-shot prompting	`few_shot`	8-exemplar few-shot, no iterative refinement
Self-critique	`self_critique`	Only the deterministic diversity-auditor; no LLM judges
Diversity-only	`diversity_only`	Coverage-guided refinement, no realism / verifier critics
Realism-only	`realism_only`	Realism discriminator + auditor; no verifier / GAN adversaries
Full classic (3-critic)	`full_classic`	Verifier + realism + auditor (3 baseline critics)
Full SynSmith (7-critic)	`full_attrforge`	All 7 critics: 3 baseline + 4 GAN-style adversaries (default)

Every baseline runs through the same harness and writes the same artifacts, so results are directly comparable.

Repository layout

synsmith/
├── schema.py            typed data models (Pydantic)
├── llm.py               backend-agnostic LLM client (OpenAI, Anthropic, echo)
├── planner.py           attribute planner (stratified, coverage-gap)
├── generator.py         per-target synthetic sample generation
├── critics/
│   ├── verifier.py      per-sample attribute audit
│   ├── discriminator.py mixed-batch real-vs-synthetic judge
│   ├── auditor.py       batch-level coverage and near-duplicate audit
│   ├── pack.py          Pack Discriminator (PacGAN analog)
│   ├── mode_seeking.py  attribute-distance / text-distance ratio (MSGAN)
│   ├── mode_hunter.py   persistent banned-phrasings library
│   └── coverage_hole.py density-ratio-based coverage finder
├── updater.py           prompt rewriter and versioned history
├── baselines.py         ablation builders for every named baseline
├── loop.py              orchestrator, persistence, run manifests
├── metrics.py           per-iteration scalar metrics
├── prompts/templates.py canonical prompt strings for every component
├── eval/downstream.py   sentence-transformer + LR downstream evaluator
└── cli.py               synsmith run | inspect | schema
examples/
├── customer_support/    5-class intent, 40 real seeds (30 train + 10 test)
└── banking77/           10-class card/payment subset, 300 train + 400 test
scripts/
├── run_experiments.py        per-condition runs across seeds
├── ensemble_deep.py          cross-condition logit-average ensemble
├── augmentation_eval.py      real + synthetic downstream eval
├── per_class_aug_eval.py     per-class F1 augmentation analysis
├── scarce_real_eval.py       n_real sweep for augmentation
├── reaudit_fixed.py          Vendi + MS-emb + 5-fold AUROC re-audit
├── diversity_metrics.py      distinct-n + self-BLEU-4
├── mmd_per_feature_space.py  MMD with TF-IDF word/char + sentence-transformer
└── worst_class_eval.py       worst-class F1 sweep
tests/                        schema, planner, end-to-end offline loop

Design notes

Backends are pluggable. OpenAI and Anthropic ship in-tree. An offline echo backend lets the full pipeline run without any API key, which is what the test suite and the dry-run config use.
Critics never see the planner's intent directly. They evaluate the resulting samples, then the planner uses their output indirectly through the updated prompt. This avoids the verifier becoming a noisy oracle.
The discriminator measures progress, not the loss. Realism feedback is surfaced to the updater as named artifacts ("too polished, follows a predictable structure"), not as a scalar to minimize, which empirically reduces mode chasing.
Prompt history is first-class. Every rewrite is appended to PromptHistory with the feedback bundle that motivated it. Comparing prompts side-by-side across iterations is the primary way to debug a run.
Determinism where it matters. The planner and generator are seeded; the critics use temperature 0.

Roadmap

Citation

If you use SynSmith in academic work, please cite:

@misc{apartsin2026synsmith,
  title  = {Adversarial Prompt Debugging for LLM Synthetic Data Generation},
  author = {Apartsin, Alexander and Aperstein, Yehudit},
  year   = {2026},
  url    = {https://github.com/ApartsinProjects/SynSmith},
  note   = {Holon Institute of Technology and Afeka College of Engineering, Israel.
            Paper: \url{https://apartsinprojects.github.io/SynSmith/}.}
}

A full project description, including the formal problem definition, attribute schema examples, evaluation phases, and the taxonomy of synthetic-data artifacts, is available in synsmith_project_description.md.

License

MIT. See LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SynSmith

Overview

Headline results

Method

Why three critics

Quickstart

Install

Dry run (no API key required)

Real run with OpenAI

Run the seven-condition paper experiment (customer-support or Banking77)

Programmatic API

Adding your own critic

Architecture

Research questions

Baselines included

Repository layout

Design notes

Roadmap

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 93 Commits
assets		assets
docs		docs
examples		examples
experiments		experiments
paper		paper
runs		runs
scripts		scripts
synsmith		synsmith
tests		tests
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
synsmith_project_description.md		synsmith_project_description.md

Folders and files

Latest commit

History

Repository files navigation

SynSmith

Overview

Headline results

Method

Why three critics

Quickstart

Install

Dry run (no API key required)

Real run with OpenAI

Run the seven-condition paper experiment (customer-support or Banking77)

Programmatic API

Adding your own critic

Architecture

Research questions

Baselines included

Repository layout

Design notes

Roadmap

Citation

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages