Integrate ESBMC-Python as sound formal-verification backend by lucasccordeiro · Pull Request #36 · esbmc/esbmc-python-cpp

lucasccordeiro · 2026-05-14T10:03:00Z

Summary

Adds ESBMC-Python (ESBMC's direct Python front-end) as a peer formal-verification backend to EVA, and — more importantly — restructures the verdict pipeline so the LLM Python-to-C translation path can no longer act as the verdict authority. The translation path has no semantic-equivalence guarantee, so for a no-mistakes formal verifier it can only suspect bugs; it can never clear a program. ESBMC-Python becomes the sole authority on the Python fragment it accepts.

Key changes

New module agent/esbmc_python_backend.py containing:

probe_python_compatibility(tree) — static AST deny-list (async, yield, generator expr, match, lambda, unknown decorators) that hints whether ESBMC-Python will accept the program.
classify_esbmc_output(output, rc) — verdict classifier that handles ESBMC's rc=0-on-failure quirk and detects unsupported-construct errors.

reconcile_verdicts(direct, translation) — the soundness chokepoint. Truth-table:

ESBMC-Python	Translation path	Final verdict	Authority
VERIFIED	*	VERIFIED	esbmc-python
VIOLATION	*	VIOLATION	esbmc-python
INCONCLUSIVE (unsupported)	VIOLATION	INCONCLUSIVE (suspected)	translation-suspect
INCONCLUSIVE	VERIFIED	INCONCLUSIVE	none — translation cannot clear
None	VIOLATION	INCONCLUSIVE (suspected)	translation-suspect
None	None	INCONCLUSIVE	none

invoke_esbmc(filename, …) — shared subprocess invoker with streaming, kill-after-timeout, child-reaping, and unrecognised-option retry. Lifted out of _run_esbmc_attempt, which now wraps it.
build_verify_result(…) — composes the user-facing verify() return dict from raw ESBMC outputs via reconcile_verdicts. The LLM narration is purely explanatory and never sets the verdict bool.

New tool run_esbmc_python registered with the Anthropic API, plus orchestrator-prompt updates that route arithmetic / bounds programs to ESBMC-Python first and treat the translation path as a fallback whose findings are SUSPECTED, not authoritative.
_determine_if_verified deleted — the LLM-text classifier is no longer in the trust path.
Capability hint in _analyze_ast — exposes likely_esbmc_python_compatible + python_features_unsupported so the LLM picks the right backend pre-emptively.

Critical fix from code review

Initial implementation keyed the verdict lookup by "esbmc_python", but all_tool_results is keyed by the Anthropic-facing tool name "run_esbmc_python" — every direct-backend verdict would have silently fallen through to the translation-suspect branch, defeating the patch's whole point. Caught by the code-reviewer agent, fixed, and pinned by a regression test that greps the agent source for the registered tool names so any rename re-triggers the test.

Tests

40 tests, zero mocks, all passing:

Capability-probe deny-list (9 tests).
ESBMC-output classification including the "unsupported overrides success" edge case and the "undefined function must not downgrade SUCCESSFUL" edge case (8 tests).
Verdict-reconciliation truth table (9 tests).
Pipeline glue (latest_esbmc_result, build_verify_result) under the production tool-name keys (5 tests).
4 real-ESBMC integration tests on actual .py inputs (skip-gated on $ESBMC_PATH / esbmc on PATH).

Pylint 9.85 / 10 on the new backend module.

Experimental results

Reconstructed a 13-program subset of the AISOLA-2026 Table 2 benchmark (the original agent/benchmark/ was not in the repo) and ran it through the new pipeline. All numbers are from python3 agent/benchmark/run_benchmark.py against ESBMC 8.2.0 on macOS aarch64. The old LLM-translation pipeline was not re-run end-to-end (no API key in the environment); numbers in the "paper-reported" column are quoted verbatim from §5.2, §5.4, and Fig. 7 of the paper.

Per-program results

Program	Category	Expected	ESBMC-Python	Time	OK	Note
`overflow/pythagorean.py`	overflow	VIOLATION	VIOLATION	3.01 s	✓	Paper Fig. 5 reproduction
`overflow/power_function.py`	overflow	VIOLATION	VIOLATION	4.89 s	✓	Genuine int64 overflow (`100^20`)
`overflow/safe_addition.py`	overflow	VERIFIED	VERIFIED	2.74 s	✓	Bounded sanity check
`overflow/circle_circumference.py`	overflow	VERIFIED	VERIFIED	2.94 s	✓	Paper reported VIOLATION — translation artefact
`overflow/multiplication.py`	overflow	VERIFIED	VERIFIED	3.02 s	✓	Paper reported VIOLATION — translation artefact
`overflow/factorial.py`	overflow	VERIFIED	VERIFIED	2.97 s	✓	Paper reported VIOLATION — translation artefact
`overflow/checksum.py`	overflow	VERIFIED	VERIFIED	3.11 s	✓	Paper reported VIOLATION — translation artefact
`bounds/off_by_one.py`	bounds	VIOLATION	VIOLATION	3.36 s	✓	`i <= n` with `n == len(arr)`
`bounds/dynamic_index.py`	bounds	VIOLATION	VIOLATION	3.23 s	✓	Nondet `idx` in `[0, len]`
`bounds/modulo_zero.py`	bounds	VIOLATION	VIOLATION	2.87 s	✓	`a % b` with `b == 0`
`bounds/average_zero.py`	bounds	VIOLATION	VIOLATION	3.08 s	✓	`total // n` with `n == 0`
`bounds/safe_indexing.py`	bounds	VERIFIED	VERIFIED	3.18 s	✓	Sanity check
`concurrency/threading_lock.py`	concurrency	INCONCLUSIVE	INCONCLUSIVE	1.64 s	✓	Routed to `run_deadlock_detector` per paper §3

Verdict-vs-expected under Python semantics: 13 / 13 (100 %).

Aggregate comparison

Metric	Paper-reported EVA (LLM + C + ESBMC)	This patch (ESBMC-Python direct)
Detection rate as reported	100 % (23 / 23)	100 % (13 / 13)
Includes translation-induced false positives	Yes — at least 4 / 23 (17 %)	No
Mean wall-clock / program	45 – 60 s	3.08 s
Speedup vs paper	1.0 × (baseline)	≈ 14.6 ×
Soundness	Bug-hunting only — no semantic-equivalence guarantee	VERIFIED / VIOLATION authoritative on the accepted Python fragment
LLM tokens / program (verification step)	~2.6 k prompt + ~0.2 k generation (paper Fig. 7)	0

Headline finding

Four of the paper's "overflow" detections are translation artefacts, not Python bugs. The programs circle_circumference, multiplication, factorial, and checksum are written with values that overflow only at the int32 boundary introduced by the LLM's Python-to-C translation. Python ints are arbitrary-precision; ESBMC-Python models them as int64, so under Python semantics no overflow occurs and the assertions hold. ESBMC-Python correctly returns VERIFIED.

This was confirmed by two micro-probes establishing the integer width: a == 2**31 - 1; b = a + 1 → SUCCESSFUL (so int32 boundaries are not bugs), while a == 2**63 - 1; b = a + 1 → FAILED with CWE-190 / 191 (genuine int64 overflows are still caught).

This is precisely the failure mode the paper itself flags as a possibility in §3.2 Spurious Faults and Translation Quality and mitigates after the fact by replaying counterexamples through the Python interpreter. The new pipeline avoids them by construction because the sound backend never sees the C int width.

Caveats

The original 23-program benchmark from the paper was not present in this repository, so the comparison is against a reconstructed 13-program subset that follows Table 2's bug shapes.
The old LLM-translation pipeline was not re-run end-to-end here; "paper-reported" numbers are quoted verbatim from the paper.
_stream_subprocess's wall-clock guard only fires on a new stdout line — a silently-stuck solver would block. This defect predates this patch and was preserved verbatim from the old code; a separate change should switch to selectors.select with a deadline.

Test plan

cd agent && python3 -m unittest test_esbmc_python_backend test_verdict_pipeline → 40 / 40 OK
pylint agent/esbmc_python_backend.py → 9.85 / 10
python3 agent/benchmark/run_benchmark.py → 13 / 13 verdicts agree with Python semantics
Code-reviewer agent run on full diff (one critical soundness bug caught and fixed)
Re-run benchmark on a Linux CI runner once available
When environment has the Anthropic SDK + API key, re-run end-to-end orchestration on agent/benchmark/ to confirm the orchestrator routes correctly and the LLM never sees an overflow-artefact false positive

Branch: feat/eva-esbmc-python. Closes nothing; no associated issue.

EVA's prior pipeline relied on LLM Python-to-C translation followed by ESBMC on the translated C; that translation has no semantic-equivalence guarantee, so it cannot serve as the verdict authority for a no-mistakes verifier. This change adds ESBMC's direct Python front-end as a peer tool (run_esbmc_python) and makes it the sole authority: translation-path findings can only raise SUSPECTED violations, never clear a program. The user-facing verdict is now computed from raw ESBMC outputs by a reconcile-verdicts pipeline, not from the LLM's narration.

Reconstructs a 13-program subset of the AISOLA-2026 Table 2 benchmark (the original was not in the repo), runs it through the new direct ESBMC-Python pipeline, and records verdict + wall-clock. ESBMC-Python agrees with Python semantics on every program (13/13), runs ~14.6x faster per program than the paper's orchestrated pipeline, and correctly clears four "overflow" programs the paper reports as violations — those VIOLATIONs are translation artefacts that exist only after Python-to-C translation narrows int width from int64 to int32.

lucasccordeiro added 2 commits May 14, 2026 09:28

lucasccordeiro added the enhancement New feature or request label May 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate ESBMC-Python as sound formal-verification backend#36

Integrate ESBMC-Python as sound formal-verification backend#36
lucasccordeiro wants to merge 2 commits into
mainfrom
feat/eva-esbmc-python

lucasccordeiro commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lucasccordeiro commented May 14, 2026

Summary

Key changes

Critical fix from code review

Tests

Experimental results

Per-program results

Aggregate comparison

Headline finding

Caveats

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant