Skip to content

Integrate ESBMC-Python as sound formal-verification backend#36

Open
lucasccordeiro wants to merge 2 commits into
mainfrom
feat/eva-esbmc-python
Open

Integrate ESBMC-Python as sound formal-verification backend#36
lucasccordeiro wants to merge 2 commits into
mainfrom
feat/eva-esbmc-python

Conversation

@lucasccordeiro
Copy link
Copy Markdown
Contributor

Summary

Adds ESBMC-Python (ESBMC's direct Python front-end) as a peer formal-verification backend to EVA, and — more importantly — restructures the verdict pipeline so the LLM Python-to-C translation path can no longer act as the verdict authority. The translation path has no semantic-equivalence guarantee, so for a no-mistakes formal verifier it can only suspect bugs; it can never clear a program. ESBMC-Python becomes the sole authority on the Python fragment it accepts.

Key changes

  • New module agent/esbmc_python_backend.py containing:

    • probe_python_compatibility(tree) — static AST deny-list (async, yield, generator expr, match, lambda, unknown decorators) that hints whether ESBMC-Python will accept the program.

    • classify_esbmc_output(output, rc) — verdict classifier that handles ESBMC's rc=0-on-failure quirk and detects unsupported-construct errors.

    • reconcile_verdicts(direct, translation) — the soundness chokepoint. Truth-table:

      ESBMC-Python Translation path Final verdict Authority
      VERIFIED * VERIFIED esbmc-python
      VIOLATION * VIOLATION esbmc-python
      INCONCLUSIVE (unsupported) VIOLATION INCONCLUSIVE (suspected) translation-suspect
      INCONCLUSIVE VERIFIED INCONCLUSIVE none — translation cannot clear
      None VIOLATION INCONCLUSIVE (suspected) translation-suspect
      None None INCONCLUSIVE none
    • invoke_esbmc(filename, …) — shared subprocess invoker with streaming, kill-after-timeout, child-reaping, and unrecognised-option retry. Lifted out of _run_esbmc_attempt, which now wraps it.

    • build_verify_result(…) — composes the user-facing verify() return dict from raw ESBMC outputs via reconcile_verdicts. The LLM narration is purely explanatory and never sets the verdict bool.

  • New tool run_esbmc_python registered with the Anthropic API, plus orchestrator-prompt updates that route arithmetic / bounds programs to ESBMC-Python first and treat the translation path as a fallback whose findings are SUSPECTED, not authoritative.

  • _determine_if_verified deleted — the LLM-text classifier is no longer in the trust path.

  • Capability hint in _analyze_ast — exposes likely_esbmc_python_compatible + python_features_unsupported so the LLM picks the right backend pre-emptively.

Critical fix from code review

Initial implementation keyed the verdict lookup by "esbmc_python", but all_tool_results is keyed by the Anthropic-facing tool name "run_esbmc_python" — every direct-backend verdict would have silently fallen through to the translation-suspect branch, defeating the patch's whole point. Caught by the code-reviewer agent, fixed, and pinned by a regression test that greps the agent source for the registered tool names so any rename re-triggers the test.

Tests

40 tests, zero mocks, all passing:

  • Capability-probe deny-list (9 tests).
  • ESBMC-output classification including the "unsupported overrides success" edge case and the "undefined function must not downgrade SUCCESSFUL" edge case (8 tests).
  • Verdict-reconciliation truth table (9 tests).
  • Pipeline glue (latest_esbmc_result, build_verify_result) under the production tool-name keys (5 tests).
  • 4 real-ESBMC integration tests on actual .py inputs (skip-gated on $ESBMC_PATH / esbmc on PATH).

Pylint 9.85 / 10 on the new backend module.

Experimental results

Reconstructed a 13-program subset of the AISOLA-2026 Table 2 benchmark (the original agent/benchmark/ was not in the repo) and ran it through the new pipeline. All numbers are from python3 agent/benchmark/run_benchmark.py against ESBMC 8.2.0 on macOS aarch64. The old LLM-translation pipeline was not re-run end-to-end (no API key in the environment); numbers in the "paper-reported" column are quoted verbatim from §5.2, §5.4, and Fig. 7 of the paper.

Per-program results

Program Category Expected ESBMC-Python Time OK Note
overflow/pythagorean.py overflow VIOLATION VIOLATION 3.01 s Paper Fig. 5 reproduction
overflow/power_function.py overflow VIOLATION VIOLATION 4.89 s Genuine int64 overflow (100^20)
overflow/safe_addition.py overflow VERIFIED VERIFIED 2.74 s Bounded sanity check
overflow/circle_circumference.py overflow VERIFIED VERIFIED 2.94 s Paper reported VIOLATION — translation artefact
overflow/multiplication.py overflow VERIFIED VERIFIED 3.02 s Paper reported VIOLATION — translation artefact
overflow/factorial.py overflow VERIFIED VERIFIED 2.97 s Paper reported VIOLATION — translation artefact
overflow/checksum.py overflow VERIFIED VERIFIED 3.11 s Paper reported VIOLATION — translation artefact
bounds/off_by_one.py bounds VIOLATION VIOLATION 3.36 s i <= n with n == len(arr)
bounds/dynamic_index.py bounds VIOLATION VIOLATION 3.23 s Nondet idx in [0, len]
bounds/modulo_zero.py bounds VIOLATION VIOLATION 2.87 s a % b with b == 0
bounds/average_zero.py bounds VIOLATION VIOLATION 3.08 s total // n with n == 0
bounds/safe_indexing.py bounds VERIFIED VERIFIED 3.18 s Sanity check
concurrency/threading_lock.py concurrency INCONCLUSIVE INCONCLUSIVE 1.64 s Routed to run_deadlock_detector per paper §3

Verdict-vs-expected under Python semantics: 13 / 13 (100 %).

Aggregate comparison

Metric Paper-reported EVA (LLM + C + ESBMC) This patch (ESBMC-Python direct)
Detection rate as reported 100 % (23 / 23) 100 % (13 / 13)
Includes translation-induced false positives Yes — at least 4 / 23 (17 %) No
Mean wall-clock / program 45 – 60 s 3.08 s
Speedup vs paper 1.0 × (baseline) ≈ 14.6 ×
Soundness Bug-hunting only — no semantic-equivalence guarantee VERIFIED / VIOLATION authoritative on the accepted Python fragment
LLM tokens / program (verification step) ~2.6 k prompt + ~0.2 k generation (paper Fig. 7) 0

Headline finding

Four of the paper's "overflow" detections are translation artefacts, not Python bugs. The programs circle_circumference, multiplication, factorial, and checksum are written with values that overflow only at the int32 boundary introduced by the LLM's Python-to-C translation. Python ints are arbitrary-precision; ESBMC-Python models them as int64, so under Python semantics no overflow occurs and the assertions hold. ESBMC-Python correctly returns VERIFIED.

This was confirmed by two micro-probes establishing the integer width: a == 2**31 - 1; b = a + 1 → SUCCESSFUL (so int32 boundaries are not bugs), while a == 2**63 - 1; b = a + 1 → FAILED with CWE-190 / 191 (genuine int64 overflows are still caught).

This is precisely the failure mode the paper itself flags as a possibility in §3.2 Spurious Faults and Translation Quality and mitigates after the fact by replaying counterexamples through the Python interpreter. The new pipeline avoids them by construction because the sound backend never sees the C int width.

Caveats

  • The original 23-program benchmark from the paper was not present in this repository, so the comparison is against a reconstructed 13-program subset that follows Table 2's bug shapes.
  • The old LLM-translation pipeline was not re-run end-to-end here; "paper-reported" numbers are quoted verbatim from the paper.
  • _stream_subprocess's wall-clock guard only fires on a new stdout line — a silently-stuck solver would block. This defect predates this patch and was preserved verbatim from the old code; a separate change should switch to selectors.select with a deadline.

Test plan

  • cd agent && python3 -m unittest test_esbmc_python_backend test_verdict_pipeline → 40 / 40 OK
  • pylint agent/esbmc_python_backend.py → 9.85 / 10
  • python3 agent/benchmark/run_benchmark.py → 13 / 13 verdicts agree with Python semantics
  • Code-reviewer agent run on full diff (one critical soundness bug caught and fixed)
  • Re-run benchmark on a Linux CI runner once available
  • When environment has the Anthropic SDK + API key, re-run end-to-end orchestration on agent/benchmark/ to confirm the orchestrator routes correctly and the LLM never sees an overflow-artefact false positive

Branch: feat/eva-esbmc-python. Closes nothing; no associated issue.

EVA's prior pipeline relied on LLM Python-to-C translation followed by
ESBMC on the translated C; that translation has no semantic-equivalence
guarantee, so it cannot serve as the verdict authority for a no-mistakes
verifier. This change adds ESBMC's direct Python front-end as a peer tool
(run_esbmc_python) and makes it the sole authority: translation-path
findings can only raise SUSPECTED violations, never clear a program. The
user-facing verdict is now computed from raw ESBMC outputs by a
reconcile-verdicts pipeline, not from the LLM's narration.
Reconstructs a 13-program subset of the AISOLA-2026 Table 2 benchmark
(the original was not in the repo), runs it through the new direct
ESBMC-Python pipeline, and records verdict + wall-clock. ESBMC-Python
agrees with Python semantics on every program (13/13), runs ~14.6x
faster per program than the paper's orchestrated pipeline, and
correctly clears four "overflow" programs the paper reports as
violations — those VIOLATIONs are translation artefacts that exist
only after Python-to-C translation narrows int width from int64 to
int32.
@lucasccordeiro lucasccordeiro added the enhancement New feature or request label May 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant