Integrate ESBMC-Python as sound formal-verification backend#36
Open
lucasccordeiro wants to merge 2 commits into
Open
Integrate ESBMC-Python as sound formal-verification backend#36lucasccordeiro wants to merge 2 commits into
lucasccordeiro wants to merge 2 commits into
Conversation
EVA's prior pipeline relied on LLM Python-to-C translation followed by ESBMC on the translated C; that translation has no semantic-equivalence guarantee, so it cannot serve as the verdict authority for a no-mistakes verifier. This change adds ESBMC's direct Python front-end as a peer tool (run_esbmc_python) and makes it the sole authority: translation-path findings can only raise SUSPECTED violations, never clear a program. The user-facing verdict is now computed from raw ESBMC outputs by a reconcile-verdicts pipeline, not from the LLM's narration.
Reconstructs a 13-program subset of the AISOLA-2026 Table 2 benchmark (the original was not in the repo), runs it through the new direct ESBMC-Python pipeline, and records verdict + wall-clock. ESBMC-Python agrees with Python semantics on every program (13/13), runs ~14.6x faster per program than the paper's orchestrated pipeline, and correctly clears four "overflow" programs the paper reports as violations — those VIOLATIONs are translation artefacts that exist only after Python-to-C translation narrows int width from int64 to int32.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds ESBMC-Python (ESBMC's direct Python front-end) as a peer formal-verification backend to EVA, and — more importantly — restructures the verdict pipeline so the LLM Python-to-C translation path can no longer act as the verdict authority. The translation path has no semantic-equivalence guarantee, so for a no-mistakes formal verifier it can only suspect bugs; it can never clear a program. ESBMC-Python becomes the sole authority on the Python fragment it accepts.
Key changes
New module
agent/esbmc_python_backend.pycontaining:probe_python_compatibility(tree)— static AST deny-list (async, yield, generator expr, match, lambda, unknown decorators) that hints whether ESBMC-Python will accept the program.classify_esbmc_output(output, rc)— verdict classifier that handles ESBMC's rc=0-on-failure quirk and detects unsupported-construct errors.reconcile_verdicts(direct, translation)— the soundness chokepoint. Truth-table:invoke_esbmc(filename, …)— shared subprocess invoker with streaming, kill-after-timeout, child-reaping, and unrecognised-option retry. Lifted out of_run_esbmc_attempt, which now wraps it.build_verify_result(…)— composes the user-facing verify() return dict from raw ESBMC outputs viareconcile_verdicts. The LLM narration is purely explanatory and never sets the verdict bool.New tool
run_esbmc_pythonregistered with the Anthropic API, plus orchestrator-prompt updates that route arithmetic / bounds programs to ESBMC-Python first and treat the translation path as a fallback whose findings are SUSPECTED, not authoritative._determine_if_verifieddeleted — the LLM-text classifier is no longer in the trust path.Capability hint in
_analyze_ast— exposeslikely_esbmc_python_compatible+python_features_unsupportedso the LLM picks the right backend pre-emptively.Critical fix from code review
Initial implementation keyed the verdict lookup by
"esbmc_python", butall_tool_resultsis keyed by the Anthropic-facing tool name"run_esbmc_python"— every direct-backend verdict would have silently fallen through to the translation-suspect branch, defeating the patch's whole point. Caught by the code-reviewer agent, fixed, and pinned by a regression test that greps the agent source for the registered tool names so any rename re-triggers the test.Tests
40 tests, zero mocks, all passing:
latest_esbmc_result,build_verify_result) under the production tool-name keys (5 tests)..pyinputs (skip-gated on$ESBMC_PATH/esbmcon PATH).Pylint 9.85 / 10 on the new backend module.
Experimental results
Reconstructed a 13-program subset of the AISOLA-2026 Table 2 benchmark (the original
agent/benchmark/was not in the repo) and ran it through the new pipeline. All numbers are frompython3 agent/benchmark/run_benchmark.pyagainst ESBMC 8.2.0 on macOS aarch64. The old LLM-translation pipeline was not re-run end-to-end (no API key in the environment); numbers in the "paper-reported" column are quoted verbatim from §5.2, §5.4, and Fig. 7 of the paper.Per-program results
overflow/pythagorean.pyoverflow/power_function.py100^20)overflow/safe_addition.pyoverflow/circle_circumference.pyoverflow/multiplication.pyoverflow/factorial.pyoverflow/checksum.pybounds/off_by_one.pyi <= nwithn == len(arr)bounds/dynamic_index.pyidxin[0, len]bounds/modulo_zero.pya % bwithb == 0bounds/average_zero.pytotal // nwithn == 0bounds/safe_indexing.pyconcurrency/threading_lock.pyrun_deadlock_detectorper paper §3Verdict-vs-expected under Python semantics: 13 / 13 (100 %).
Aggregate comparison
Headline finding
Four of the paper's "overflow" detections are translation artefacts, not Python bugs. The programs
circle_circumference,multiplication,factorial, andchecksumare written with values that overflow only at the int32 boundary introduced by the LLM's Python-to-C translation. Python ints are arbitrary-precision; ESBMC-Python models them as int64, so under Python semantics no overflow occurs and the assertions hold. ESBMC-Python correctly returns VERIFIED.This was confirmed by two micro-probes establishing the integer width:
a == 2**31 - 1; b = a + 1→ SUCCESSFUL (so int32 boundaries are not bugs), whilea == 2**63 - 1; b = a + 1→ FAILED with CWE-190 / 191 (genuine int64 overflows are still caught).This is precisely the failure mode the paper itself flags as a possibility in §3.2 Spurious Faults and Translation Quality and mitigates after the fact by replaying counterexamples through the Python interpreter. The new pipeline avoids them by construction because the sound backend never sees the C int width.
Caveats
_stream_subprocess's wall-clock guard only fires on a new stdout line — a silently-stuck solver would block. This defect predates this patch and was preserved verbatim from the old code; a separate change should switch toselectors.selectwith a deadline.Test plan
cd agent && python3 -m unittest test_esbmc_python_backend test_verdict_pipeline→ 40 / 40 OKpylint agent/esbmc_python_backend.py→ 9.85 / 10python3 agent/benchmark/run_benchmark.py→ 13 / 13 verdicts agree with Python semanticsagent/benchmark/to confirm the orchestrator routes correctly and the LLM never sees an overflow-artefact false positiveBranch:
feat/eva-esbmc-python. Closes nothing; no associated issue.