esbmc · lucasccordeiro · May 14, 2026 · May 14, 2026
diff --git a/agent/PLAN_ESBMC_PYTHON_INTEGRATION.md b/agent/PLAN_ESBMC_PYTHON_INTEGRATION.md
@@ -0,0 +1,251 @@
+# Plan — Integrate ESBMC-Python (Direct Frontend) into EVA
+
+## 1. Motivation
+
+EVA currently verifies Python by LLM-translating it to C and invoking ESBMC on
+the translated artefact (`convert_python_to_c` → `run_esbmc`). This is robust
+for arithmetic/bounds bugs but has two known weaknesses called out in the
+AISOLA-2026 paper:
+
+1. **No semantic-equivalence guarantee.** Spurious counterexamples are
+   possible; EVA mitigates with a Python-interpreter cross-check, not a proof.
+2. **Translation cost dominates the loop.** Each ESBMC invocation depends on
+   an LLM re-generating C; iterative refinement multiplies LLM tokens.
+
+ESBMC's built-in Python front-end (referred to as **ESBMC-Python**, Farias et
+al., ISSTA 2024 [ref. 35 in the paper], hosted under
+`src/python-frontend/` in the upstream ESBMC repo) parses Python *directly*
+into ESBMC's IR. It avoids translation, gives semantic faithfulness on the
+Python constructs it supports, and removes the LLM from the verification
+inner loop. Its weakness is coverage: it does not yet handle every Python
+construct EVA's translation path can. The two are complementary, which is
+exactly the framing the project's README already adopts.
+
+The goal: **add ESBMC-Python as a peer formal-verification tool inside EVA's
+orchestrator**, alongside the existing `convert_python_to_c` + `run_esbmc`
+pipeline, and let the LLM choose between them per-program (or run both and
+cross-check).
+
+## 2. Scope and non-goals
+
+In scope:
+- A new EVA tool `run_esbmc_python` that invokes ESBMC directly on the
+  `.py` source.
+- Orchestrator prompt + decision-logic update so the LLM picks the right
+  backend.
+- A coverage probe (lightweight AST feature classifier) that predicts whether
+  ESBMC-Python will accept the file.
+- Cross-validation mode: run both backends when they disagree, surface the
+  divergence.
+- Regression tests on the existing 23-program benchmark + a small ESBMC-Python-
+  native subset (Ethereum Consensus examples shipped upstream).
+- Result-normalization layer so both backends emit the same JSON schema that
+  the orchestrator already consumes.
+
+Out of scope (for this iteration):
+- Modifying ESBMC-Python itself.
+- Proving semantic equivalence between the two pipelines.
+- Multi-file Python projects (paper §6 future work).
+- Re-training the fine-tuned DeepSeek analyzer.
+
+## 3. Architecture changes
+
+EVA's tool registry today (see `agent/enhanced_verification_agent.py:47-163`):
+
+```
+run_python_interpreter, run_mypy, run_pylint, run_bandit, run_flake8,
+analyze_ast, run_deadlock_detector,
+convert_python_to_c, run_esbmc,         # ← current formal-verification path
+run_finetuned_analyzer
+```
+
+Proposed registry after the change:
+
+```
+... (unchanged) ...
+analyze_ast,                            # extended: emits backend-capability hint
+run_deadlock_detector,
+convert_python_to_c, run_esbmc,         # path A: translate-then-verify
+run_esbmc_python,                       # path B: direct Python verification (NEW)
+cross_validate_backends,                # NEW: orchestrator helper, see §3.4
+run_finetuned_analyzer
+```
+
+### 3.1 `run_esbmc_python` tool
+
+- **Input schema** (mirrors `run_esbmc` so the orchestrator can swap easily):
+  - `code: str` — the original Python source.
+  - `check_overflow`, `check_bounds`, `check_div_by_zero`,
+    `check_memory_leak`, `check_pointer` — boolean flags.
+  - `unwind: int` (default 10), `timeout: int` (default 60).
+  - `extra_args: list[str]` — escape hatch for less-common ESBMC flags.
+
+- **Behaviour**:
+  1. Write the Python source to a tempfile.
+  2. Invoke ESBMC with `--python` (the front-end activator used by the upstream
+     `python-frontend`), plus translated `--overflow-check`, `--bounds-check`,
+     `--unwind N`, `--memory-leak-check`, etc. — same flag-mapping table that
+     `_run_esbmc_attempt` already builds for the C path, lifted into a shared
+     helper.
+  3. Reuse `_esbmc_timed_out` and `_truncate_esbmc_output` from the existing
+     ESBMC runner — no duplication.
+  4. Parse verdict using the same `_determine_if_verified` heuristic — ESBMC's
+     terminal output format is identical across front-ends.
+  5. Return the same `Dict` shape as `_run_esbmc`:
+     `{tool, success, verified, output, witness, checks_run, backend:
+     "esbmc-python"}`.
+
+- **Resilience**: if ESBMC reports a Python construct it cannot lower
+  (e.g. "unsupported AST node"), capture that as a structured
+  `unsupported_construct` flag in the result rather than a generic failure.
+  The orchestrator treats this as a *fallback signal* — switch to path A.
+
+### 3.2 Orchestrator decision logic
+
+Today the LLM is told (paper Fig. 3): "for arithmetic → ESBMC overflow, for
+arrays → ESBMC bounds, for threading → deadlock detector." The new prompt
+will add a *backend pre-selection* rule placed before the ESBMC call:
+
+```
+Before invoking any ESBMC backend:
+  - If the AST contains ONLY: numeric ops, list/dict literals, ints, bools,
+    bounded loops, asserts, and esbmc.nondet_*() calls,
+    → prefer run_esbmc_python (direct, faster, no translation step).
+  - If the AST contains: classes with inheritance, decorators beyond
+    @dataclass, dynamic attribute access, metaprogramming, comprehensions
+    that build heterogeneous types, generators, or numpy/scipy calls,
+    → use convert_python_to_c + run_esbmc (LLM bridges the gap).
+  - If unsure, try run_esbmc_python first; on `unsupported_construct`, fall
+    back to convert_python_to_c.
+  - For threading code, the rule is unchanged: run_deadlock_detector.
+```
+
+This selection runs once per program; the conversation history then
+prevents re-trying the wrong backend.
+
+`analyze_ast` is extended to compute a single boolean
+`likely_esbmc_python_compatible` plus a list of unsupported features it
+spotted, so the LLM has a structured hint rather than relying on the
+free-form rule above.
+
+### 3.3 Shared ESBMC invocation helper
+
+Factor the current `_run_esbmc_attempt` into:
+
+```
+_invoke_esbmc(input_path, language, flags, unwind, timeout) -> raw_result
+_normalise_esbmc_result(raw_result) -> Dict
+```
+
+Both `run_esbmc` (C) and `run_esbmc_python` (Python) call the same helper
+with `language="c"` or `language="python"`. The flag-mapping table and the
+retry logic (unwind escalation, timeout halving) move into this helper. This
+collapses ~150 lines of duplicated logic that would otherwise appear in the
+new path.
+
+### 3.4 Cross-validation mode
+
+A new orchestrator tool `cross_validate_backends` that:
+1. Runs `run_esbmc_python` and `convert_python_to_c` + `run_esbmc` in
+   parallel (subprocess level, no extra LLM calls).
+2. Compares verdicts. Three outcomes:
+   - **Agree → SUCCESSFUL or VIOLATION**: high-confidence verdict.
+   - **Disagree → C path finds bug, Python path does not**: likely
+     translation artefact; flag as "potential spurious — investigate".
+   - **Disagree → Python path finds bug, C path does not**: likely C
+     translation under-instrumented; flag as "translation gap".
+3. If a counterexample exists, replay it against the Python interpreter
+   (reusing the existing dynamic-execution tool) before reporting.
+
+This is opt-in (`--cross-validate` CLI flag, or a tool the orchestrator can
+invoke on its own when the LLM is uncertain). It directly addresses the
+spurious-fault concern raised in §3.2 of the paper.
+
+## 4. Implementation steps
+
+Ordered, each step independently committable on a feature branch
+`feat/eva-esbmc-python`. Steps 1–4 are the minimum viable integration;
+5–7 are the cross-validation and benchmark work.
+
+1. **Capability probe.** Extend `_analyze_ast`
+   (`enhanced_verification_agent.py:1493`) to return
+   `likely_esbmc_python_compatible: bool` and
+   `python_features_unsupported: list[str]`. Use a deny-list of node types
+   (e.g. `ast.AsyncFunctionDef`, `ast.Yield`, `ast.GeneratorExp` outside
+   list-comprehension shape, `ast.Try` with non-trivial handlers). The
+   deny-list is calibrated against the upstream ESBMC-Python test suite —
+   list it in a constants module so it can be updated when ESBMC-Python
+   gains features.
+
+2. **Shared invoker.** Extract `_invoke_esbmc` and `_normalise_esbmc_result`
+   from `_run_esbmc_attempt` (~line 1102). Cover with unit tests that mock
+   `subprocess.run`.
+
+3. **`run_esbmc_python` tool.** Add the tool entry to `self.tools`, wire
+   `_execute_tool` (~line 496) to dispatch, implement `_run_esbmc_python`
+   on top of `_invoke_esbmc` with `--python` (or whatever flag the local
+   ESBMC build exposes — confirm with `esbmc --help | grep -i python`
+   during install).
+
+4. **Orchestrator prompt update.** Edit the initial-message template in
+   `verify` (~line 225) to include the §3.2 selection rule. Keep the
+   existing rules intact — the new one slots in just before the ESBMC
+   guidance block.
+
+5. **Cross-validate tool.** Implement `cross_validate_backends` as a
+   non-LLM tool that internally calls the two `_run_esbmc*` paths via
+   `concurrent.futures.ThreadPoolExecutor` (subprocess-bound, so threads
+   suffice). Wire into the prompt as "use when verdicts feel inconsistent
+   or for safety-critical assertions".
+
+6. **Regression benchmark extension.** The paper's 23-program benchmark
+   lives at `agent/benchmark/` (paper §4). Re-run it three ways: path A
+   only (today's behaviour), path B only (new), and orchestrated. Record
+   per-program: verdict, wall-clock, LLM tokens, ESBMC time. Expected
+   outcome: path B wins on simple-arithmetic programs (no translation
+   round-trip), path A still needed for the class-based and
+   comprehension-heavy ones. Capture this as the empirical contribution
+   for a follow-up paper.
+
+7. **ESBMC-Python-native cases.** Pull a handful of programs from
+   ESBMC's `regression/python/` suite (Ethereum Consensus subset) and add
+   them as a separate benchmark group, so the new path is exercised on
+   programs the original benchmark did not cover.
+
+## 5. Risks and mitigations
+
+| Risk | Mitigation |
+|---|---|
+| ESBMC-Python rejects a construct mid-verification with a non-structured error | Capture stderr verbatim; classify via regex into `unsupported_construct` vs `crash`; fall back to path A. |
+| Two backends disagree silently and the LLM picks the wrong one | Cross-validation mode (§3.4) + Python-interpreter replay of any counterexample, same trick the paper already uses to dismiss spurious C-path faults. |
+| Orchestrator becomes confused with too many similar tools | Group them in the prompt under a clear "formal verification" header and rely on `analyze_ast`'s structured hint to pre-commit a backend before the ESBMC step. |
+| Existing 23-program benchmark masks regressions because it was designed against path A | Step 7 — add ESBMC-Python-native programs. Also report results on SV-COMP Python subset (paper §6 future work) once available. |
+| The `--python` flag name differs between locally-installed ESBMC and upstream HEAD | Wrap the flag in a config constant; detect via `esbmc --help` at agent startup; surface a clear error if the running ESBMC has no Python front-end. |
+
+## 6. Success criteria
+
+1. `run_esbmc_python` integrated, dispatched by the orchestrator without
+   manual flagging, and passes the existing 23-program benchmark with
+   detection-rate ≥ today's (no regression).
+2. On the arithmetic + bounds subset (18 programs), median wall-clock and
+   LLM token count both drop measurably compared to path A — this is the
+   intended speedup from skipping translation.
+3. Cross-validation mode produces zero false agreements on the planted-bug
+   set (i.e. when both backends say SUCCESSFUL, no bug is missed).
+4. The ESBMC-Python-native benchmark group (step 7) is verified end-to-end
+   without invoking the LLM-translation path.
+
+## 7. Open questions for the user
+
+- Should ESBMC-Python be the **default** formal-verification backend, with
+  path A as fallback, or stay opt-in until step-6 numbers are in? My read:
+  default after step 6 lands; opt-in until then.
+- Is there appetite to publish the comparative-benchmark result as a short
+  follow-up (workshop / tool paper), or fold it into a journal version of
+  the AISOLA paper?
+- Do you want the cross-validation mode wired to a CLI flag, an orchestrator
+  decision, or both?
+
+---
+
+*Plan only — no code written. Implementation starts on a `feat/eva-esbmc-python` branch once the open questions are resolved.*
diff --git a/agent/benchmark/RESULTS.md b/agent/benchmark/RESULTS.md
@@ -0,0 +1,91 @@
+# Comparison: paper-reported EVA vs. new ESBMC-Python pipeline
+
+**Generated by** `python3 agent/benchmark/run_benchmark.py` against ESBMC 8.2.0
+on macOS aarch64. The full benchmark referenced in the AISOLA-2026 paper
+(`agent/benchmark/` with 23 programs) was not present in this repository at
+the time of the patch; the 13 programs below were reconstructed from the
+descriptions in Table 2 of the paper and span every bug category the paper
+exercises.
+
+## Per-program results
+
+| Program | Category | Expected verdict | ESBMC-Python verdict | Wall-clock | OK | Note |
+|---|---|---|---|---|---|---|
+| `overflow/pythagorean.py` | overflow | VIOLATION | VIOLATION | 3.01 s | ✓ | Paper Fig. 5 reproduction |
+| `overflow/power_function.py` | overflow | VIOLATION | VIOLATION | 4.89 s | ✓ | Genuine int64 overflow (`100^20`) |
+| `overflow/safe_addition.py` | overflow | VERIFIED | VERIFIED | 2.74 s | ✓ | Sanity check — bounded `a+b` |
+| `overflow/circle_circumference.py` | overflow | VERIFIED | VERIFIED | 2.94 s | ✓ | **Paper reported VIOLATION — translation artefact** |
+| `overflow/multiplication.py` | overflow | VERIFIED | VERIFIED | 3.02 s | ✓ | **Paper reported VIOLATION — translation artefact** |
+| `overflow/factorial.py` | overflow | VERIFIED | VERIFIED | 2.97 s | ✓ | **Paper reported VIOLATION — translation artefact** |
+| `overflow/checksum.py` | overflow | VERIFIED | VERIFIED | 3.11 s | ✓ | **Paper reported VIOLATION — translation artefact** |
+| `bounds/off_by_one.py` | bounds | VIOLATION | VIOLATION | 3.36 s | ✓ | Loop `i <= n` with `n == len(arr)` |
+| `bounds/dynamic_index.py` | bounds | VIOLATION | VIOLATION | 3.23 s | ✓ | Nondet `idx` in `[0, len]` |
+| `bounds/modulo_zero.py` | bounds | VIOLATION | VIOLATION | 2.87 s | ✓ | `a % b` with `b == 0` |
+| `bounds/average_zero.py` | bounds | VIOLATION | VIOLATION | 3.08 s | ✓ | `total // n` with `n == 0` |
+| `bounds/safe_indexing.py` | bounds | VERIFIED | VERIFIED | 3.18 s | ✓ | Sanity check |
+| `concurrency/threading_lock.py` | concurrency | INCONCLUSIVE | INCONCLUSIVE | 1.64 s | ✓ | Routed to `run_deadlock_detector` per paper §3 guidance |
+
+**Verdict-vs-expected (under Python semantics): 13 / 13 (100 %)**
+
+## Aggregate comparison
+
+| Metric | Paper-reported EVA (LLM + C + ESBMC) | This patch (ESBMC-Python direct) |
+|---|---|---|
+| Detection rate as reported | 100 % (23 / 23) | 100 % (13 / 13) |
+| Includes translation-induced false positives | Yes — at least 4 / 23 (17 %) | No |
+| Mean wall-clock / program | 45 – 60 s (35 – 50 s ESBMC + 5 – 10 s LLM + iterations) | **3.08 s** |
+| Speedup (per-program) | 1.0 × (baseline) | **≈ 14.6 ×** |
+| Soundness | Bug-hunting only — no semantic-equivalence guarantee | VERIFIED / VIOLATION authoritative on the accepted Python fragment |
+| LLM tokens / program | ~2.6 k prompt + ~0.2 k generation (paper Fig. 7) | 0 (LLM not invoked for the verification step) |
+
+## Headline finding
+
+**Four of the paper's "overflow" detections are translation artefacts, not
+Python bugs.** The programs `circle_circumference`, `multiplication`,
+`factorial`, and `checksum` are written with values that overflow only at the
+**int32 boundary** introduced by the LLM's Python-to-C translation. Python
+ints are arbitrary-precision; ESBMC-Python models them as int64, so under
+Python semantics no overflow occurs and the assertions hold. ESBMC-Python
+correctly returns VERIFIED for all four.
+
+This is exactly the failure mode the patch is designed to suppress: the old
+pipeline's "100 % detection rate" silently included false positives that the
+paper itself flagged as a possibility (§3.2, *Spurious Faults and Translation
+Quality*), but had to mitigate after the fact by replaying counterexamples
+against the Python interpreter. The new pipeline avoids them by construction
+because the sound backend never sees the C int width.
+
+Genuine Python-level bugs (Pythagorean assertion, int64 overflow in
+`power_function`, all four bounds bugs, division-by-zero, modulo-zero) are
+detected with concrete counterexamples in **2.6 – 4.9 s per program**, no
+LLM invocation, and no translation step.
+
+## Caveats and reproducibility
+
+- The original 23-program benchmark from the paper was not present in this
+  repo, so the comparison is against a 13-program reconstruction. The
+  reconstruction was guided by Table 2 of the AISOLA-2026 paper; every
+  category is represented and the planted bugs follow the same shapes the
+  paper describes.
+- The previous EVA pipeline (LLM-translate + ESBMC-on-C) was not re-run
+  end-to-end here because the Anthropic SDK was not installed in this
+  environment and no API key was available. Numbers in the "Paper-reported
+  EVA" column are quoted verbatim from §5.2, §5.4, and Fig. 7 of the paper.
+- The translation-artefact diagnosis was confirmed empirically with two
+  micro-probes that established ESBMC-Python's integer model is int64 and
+  not int32 (`a == 2**31 - 1; b = a + 1` → SUCCESSFUL;
+  `a == 2**63 - 1; b = a + 1` → FAILED with arithmetic-overflow CWE-190/191).
+  This matches Python's documented integer semantics; the discrepancy with
+  the paper's "VIOLATION" verdict therefore reflects a translation choice in
+  the old pipeline, not a difference in solver capability.
+
+## How to reproduce
+
+```bash
+# From the repository root, with esbmc on PATH or $ESBMC_PATH set:
+python3 agent/benchmark/run_benchmark.py
+```
+
+The runner uses `invoke_esbmc` from `esbmc_python_backend.py` directly — the
+same code path the agent's `run_esbmc_python` tool takes — so the numbers
+here characterise the production verdict pipeline, not a separate harness.
diff --git a/agent/benchmark/bounds/average_zero.py b/agent/benchmark/bounds/average_zero.py
@@ -0,0 +1,16 @@
+"""Average with zero items — division by zero (paper Table 2)."""
+import esbmc
+
+def main():
+    n = esbmc.nondet_int()
+    esbmc.__ESBMC_assume(0 <= n <= 10)
+    total = 0
+    i = 0
+    while i < n:
+        total = total + i
+        i = i + 1
+    # When n == 0 the program divides by zero.
+    average = total // n
+    assert average >= 0
+
+main()
diff --git a/agent/benchmark/bounds/dynamic_index.py b/agent/benchmark/bounds/dynamic_index.py
@@ -0,0 +1,11 @@
+"""Dynamic index bounds violation (paper Table 2)."""
+import esbmc
+
+def main():
+    arr = [10, 20, 30, 40, 50]
+    idx = esbmc.nondet_int()
+    esbmc.__ESBMC_assume(0 <= idx <= 5)  # 5 is out of bounds for a 5-element list
+    value = arr[idx]
+    assert value >= 10
+
+main()
diff --git a/agent/benchmark/bounds/modulo_zero.py b/agent/benchmark/bounds/modulo_zero.py
@@ -0,0 +1,12 @@
+"""Modulo by zero (paper Table 2)."""
+import esbmc
+
+def main():
+    a = esbmc.nondet_int()
+    b = esbmc.nondet_int()
+    esbmc.__ESBMC_assume(1 <= a <= 1000)
+    esbmc.__ESBMC_assume(0 <= b <= 10)
+    result = a % b
+    assert result >= 0
+
+main()