Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
251 changes: 251 additions & 0 deletions agent/PLAN_ESBMC_PYTHON_INTEGRATION.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,251 @@
# Plan — Integrate ESBMC-Python (Direct Frontend) into EVA

## 1. Motivation

EVA currently verifies Python by LLM-translating it to C and invoking ESBMC on
the translated artefact (`convert_python_to_c` → `run_esbmc`). This is robust
for arithmetic/bounds bugs but has two known weaknesses called out in the
AISOLA-2026 paper:

1. **No semantic-equivalence guarantee.** Spurious counterexamples are
possible; EVA mitigates with a Python-interpreter cross-check, not a proof.
2. **Translation cost dominates the loop.** Each ESBMC invocation depends on
an LLM re-generating C; iterative refinement multiplies LLM tokens.

ESBMC's built-in Python front-end (referred to as **ESBMC-Python**, Farias et
al., ISSTA 2024 [ref. 35 in the paper], hosted under
`src/python-frontend/` in the upstream ESBMC repo) parses Python *directly*
into ESBMC's IR. It avoids translation, gives semantic faithfulness on the
Python constructs it supports, and removes the LLM from the verification
inner loop. Its weakness is coverage: it does not yet handle every Python
construct EVA's translation path can. The two are complementary, which is
exactly the framing the project's README already adopts.

The goal: **add ESBMC-Python as a peer formal-verification tool inside EVA's
orchestrator**, alongside the existing `convert_python_to_c` + `run_esbmc`
pipeline, and let the LLM choose between them per-program (or run both and
cross-check).

## 2. Scope and non-goals

In scope:
- A new EVA tool `run_esbmc_python` that invokes ESBMC directly on the
`.py` source.
- Orchestrator prompt + decision-logic update so the LLM picks the right
backend.
- A coverage probe (lightweight AST feature classifier) that predicts whether
ESBMC-Python will accept the file.
- Cross-validation mode: run both backends when they disagree, surface the
divergence.
- Regression tests on the existing 23-program benchmark + a small ESBMC-Python-
native subset (Ethereum Consensus examples shipped upstream).
- Result-normalization layer so both backends emit the same JSON schema that
the orchestrator already consumes.

Out of scope (for this iteration):
- Modifying ESBMC-Python itself.
- Proving semantic equivalence between the two pipelines.
- Multi-file Python projects (paper §6 future work).
- Re-training the fine-tuned DeepSeek analyzer.

## 3. Architecture changes

EVA's tool registry today (see `agent/enhanced_verification_agent.py:47-163`):

```
run_python_interpreter, run_mypy, run_pylint, run_bandit, run_flake8,
analyze_ast, run_deadlock_detector,
convert_python_to_c, run_esbmc, # ← current formal-verification path
run_finetuned_analyzer
```

Proposed registry after the change:

```
... (unchanged) ...
analyze_ast, # extended: emits backend-capability hint
run_deadlock_detector,
convert_python_to_c, run_esbmc, # path A: translate-then-verify
run_esbmc_python, # path B: direct Python verification (NEW)
cross_validate_backends, # NEW: orchestrator helper, see §3.4
run_finetuned_analyzer
```

### 3.1 `run_esbmc_python` tool

- **Input schema** (mirrors `run_esbmc` so the orchestrator can swap easily):
- `code: str` — the original Python source.
- `check_overflow`, `check_bounds`, `check_div_by_zero`,
`check_memory_leak`, `check_pointer` — boolean flags.
- `unwind: int` (default 10), `timeout: int` (default 60).
- `extra_args: list[str]` — escape hatch for less-common ESBMC flags.

- **Behaviour**:
1. Write the Python source to a tempfile.
2. Invoke ESBMC with `--python` (the front-end activator used by the upstream
`python-frontend`), plus translated `--overflow-check`, `--bounds-check`,
`--unwind N`, `--memory-leak-check`, etc. — same flag-mapping table that
`_run_esbmc_attempt` already builds for the C path, lifted into a shared
helper.
3. Reuse `_esbmc_timed_out` and `_truncate_esbmc_output` from the existing
ESBMC runner — no duplication.
4. Parse verdict using the same `_determine_if_verified` heuristic — ESBMC's
terminal output format is identical across front-ends.
5. Return the same `Dict` shape as `_run_esbmc`:
`{tool, success, verified, output, witness, checks_run, backend:
"esbmc-python"}`.

- **Resilience**: if ESBMC reports a Python construct it cannot lower
(e.g. "unsupported AST node"), capture that as a structured
`unsupported_construct` flag in the result rather than a generic failure.
The orchestrator treats this as a *fallback signal* — switch to path A.

### 3.2 Orchestrator decision logic

Today the LLM is told (paper Fig. 3): "for arithmetic → ESBMC overflow, for
arrays → ESBMC bounds, for threading → deadlock detector." The new prompt
will add a *backend pre-selection* rule placed before the ESBMC call:

```
Before invoking any ESBMC backend:
- If the AST contains ONLY: numeric ops, list/dict literals, ints, bools,
bounded loops, asserts, and esbmc.nondet_*() calls,
→ prefer run_esbmc_python (direct, faster, no translation step).
- If the AST contains: classes with inheritance, decorators beyond
@dataclass, dynamic attribute access, metaprogramming, comprehensions
that build heterogeneous types, generators, or numpy/scipy calls,
→ use convert_python_to_c + run_esbmc (LLM bridges the gap).
- If unsure, try run_esbmc_python first; on `unsupported_construct`, fall
back to convert_python_to_c.
- For threading code, the rule is unchanged: run_deadlock_detector.
```

This selection runs once per program; the conversation history then
prevents re-trying the wrong backend.

`analyze_ast` is extended to compute a single boolean
`likely_esbmc_python_compatible` plus a list of unsupported features it
spotted, so the LLM has a structured hint rather than relying on the
free-form rule above.

### 3.3 Shared ESBMC invocation helper

Factor the current `_run_esbmc_attempt` into:

```
_invoke_esbmc(input_path, language, flags, unwind, timeout) -> raw_result
_normalise_esbmc_result(raw_result) -> Dict
```

Both `run_esbmc` (C) and `run_esbmc_python` (Python) call the same helper
with `language="c"` or `language="python"`. The flag-mapping table and the
retry logic (unwind escalation, timeout halving) move into this helper. This
collapses ~150 lines of duplicated logic that would otherwise appear in the
new path.

### 3.4 Cross-validation mode

A new orchestrator tool `cross_validate_backends` that:
1. Runs `run_esbmc_python` and `convert_python_to_c` + `run_esbmc` in
parallel (subprocess level, no extra LLM calls).
2. Compares verdicts. Three outcomes:
- **Agree → SUCCESSFUL or VIOLATION**: high-confidence verdict.
- **Disagree → C path finds bug, Python path does not**: likely
translation artefact; flag as "potential spurious — investigate".
- **Disagree → Python path finds bug, C path does not**: likely C
translation under-instrumented; flag as "translation gap".
3. If a counterexample exists, replay it against the Python interpreter
(reusing the existing dynamic-execution tool) before reporting.

This is opt-in (`--cross-validate` CLI flag, or a tool the orchestrator can
invoke on its own when the LLM is uncertain). It directly addresses the
spurious-fault concern raised in §3.2 of the paper.

## 4. Implementation steps

Ordered, each step independently committable on a feature branch
`feat/eva-esbmc-python`. Steps 1–4 are the minimum viable integration;
5–7 are the cross-validation and benchmark work.

1. **Capability probe.** Extend `_analyze_ast`
(`enhanced_verification_agent.py:1493`) to return
`likely_esbmc_python_compatible: bool` and
`python_features_unsupported: list[str]`. Use a deny-list of node types
(e.g. `ast.AsyncFunctionDef`, `ast.Yield`, `ast.GeneratorExp` outside
list-comprehension shape, `ast.Try` with non-trivial handlers). The
deny-list is calibrated against the upstream ESBMC-Python test suite —
list it in a constants module so it can be updated when ESBMC-Python
gains features.

2. **Shared invoker.** Extract `_invoke_esbmc` and `_normalise_esbmc_result`
from `_run_esbmc_attempt` (~line 1102). Cover with unit tests that mock
`subprocess.run`.

3. **`run_esbmc_python` tool.** Add the tool entry to `self.tools`, wire
`_execute_tool` (~line 496) to dispatch, implement `_run_esbmc_python`
on top of `_invoke_esbmc` with `--python` (or whatever flag the local
ESBMC build exposes — confirm with `esbmc --help | grep -i python`
during install).

4. **Orchestrator prompt update.** Edit the initial-message template in
`verify` (~line 225) to include the §3.2 selection rule. Keep the
existing rules intact — the new one slots in just before the ESBMC
guidance block.

5. **Cross-validate tool.** Implement `cross_validate_backends` as a
non-LLM tool that internally calls the two `_run_esbmc*` paths via
`concurrent.futures.ThreadPoolExecutor` (subprocess-bound, so threads
suffice). Wire into the prompt as "use when verdicts feel inconsistent
or for safety-critical assertions".

6. **Regression benchmark extension.** The paper's 23-program benchmark
lives at `agent/benchmark/` (paper §4). Re-run it three ways: path A
only (today's behaviour), path B only (new), and orchestrated. Record
per-program: verdict, wall-clock, LLM tokens, ESBMC time. Expected
outcome: path B wins on simple-arithmetic programs (no translation
round-trip), path A still needed for the class-based and
comprehension-heavy ones. Capture this as the empirical contribution
for a follow-up paper.

7. **ESBMC-Python-native cases.** Pull a handful of programs from
ESBMC's `regression/python/` suite (Ethereum Consensus subset) and add
them as a separate benchmark group, so the new path is exercised on
programs the original benchmark did not cover.

## 5. Risks and mitigations

| Risk | Mitigation |
|---|---|
| ESBMC-Python rejects a construct mid-verification with a non-structured error | Capture stderr verbatim; classify via regex into `unsupported_construct` vs `crash`; fall back to path A. |
| Two backends disagree silently and the LLM picks the wrong one | Cross-validation mode (§3.4) + Python-interpreter replay of any counterexample, same trick the paper already uses to dismiss spurious C-path faults. |
| Orchestrator becomes confused with too many similar tools | Group them in the prompt under a clear "formal verification" header and rely on `analyze_ast`'s structured hint to pre-commit a backend before the ESBMC step. |
| Existing 23-program benchmark masks regressions because it was designed against path A | Step 7 — add ESBMC-Python-native programs. Also report results on SV-COMP Python subset (paper §6 future work) once available. |
| The `--python` flag name differs between locally-installed ESBMC and upstream HEAD | Wrap the flag in a config constant; detect via `esbmc --help` at agent startup; surface a clear error if the running ESBMC has no Python front-end. |

## 6. Success criteria

1. `run_esbmc_python` integrated, dispatched by the orchestrator without
manual flagging, and passes the existing 23-program benchmark with
detection-rate ≥ today's (no regression).
2. On the arithmetic + bounds subset (18 programs), median wall-clock and
LLM token count both drop measurably compared to path A — this is the
intended speedup from skipping translation.
3. Cross-validation mode produces zero false agreements on the planted-bug
set (i.e. when both backends say SUCCESSFUL, no bug is missed).
4. The ESBMC-Python-native benchmark group (step 7) is verified end-to-end
without invoking the LLM-translation path.

## 7. Open questions for the user

- Should ESBMC-Python be the **default** formal-verification backend, with
path A as fallback, or stay opt-in until step-6 numbers are in? My read:
default after step 6 lands; opt-in until then.
- Is there appetite to publish the comparative-benchmark result as a short
follow-up (workshop / tool paper), or fold it into a journal version of
the AISOLA paper?
- Do you want the cross-validation mode wired to a CLI flag, an orchestrator
decision, or both?

---

*Plan only — no code written. Implementation starts on a `feat/eva-esbmc-python` branch once the open questions are resolved.*
91 changes: 91 additions & 0 deletions agent/benchmark/RESULTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
# Comparison: paper-reported EVA vs. new ESBMC-Python pipeline

**Generated by** `python3 agent/benchmark/run_benchmark.py` against ESBMC 8.2.0
on macOS aarch64. The full benchmark referenced in the AISOLA-2026 paper
(`agent/benchmark/` with 23 programs) was not present in this repository at
the time of the patch; the 13 programs below were reconstructed from the
descriptions in Table 2 of the paper and span every bug category the paper
exercises.

## Per-program results

| Program | Category | Expected verdict | ESBMC-Python verdict | Wall-clock | OK | Note |
|---|---|---|---|---|---|---|
| `overflow/pythagorean.py` | overflow | VIOLATION | VIOLATION | 3.01 s | ✓ | Paper Fig. 5 reproduction |
| `overflow/power_function.py` | overflow | VIOLATION | VIOLATION | 4.89 s | ✓ | Genuine int64 overflow (`100^20`) |
| `overflow/safe_addition.py` | overflow | VERIFIED | VERIFIED | 2.74 s | ✓ | Sanity check — bounded `a+b` |
| `overflow/circle_circumference.py` | overflow | VERIFIED | VERIFIED | 2.94 s | ✓ | **Paper reported VIOLATION — translation artefact** |
| `overflow/multiplication.py` | overflow | VERIFIED | VERIFIED | 3.02 s | ✓ | **Paper reported VIOLATION — translation artefact** |
| `overflow/factorial.py` | overflow | VERIFIED | VERIFIED | 2.97 s | ✓ | **Paper reported VIOLATION — translation artefact** |
| `overflow/checksum.py` | overflow | VERIFIED | VERIFIED | 3.11 s | ✓ | **Paper reported VIOLATION — translation artefact** |
| `bounds/off_by_one.py` | bounds | VIOLATION | VIOLATION | 3.36 s | ✓ | Loop `i <= n` with `n == len(arr)` |
| `bounds/dynamic_index.py` | bounds | VIOLATION | VIOLATION | 3.23 s | ✓ | Nondet `idx` in `[0, len]` |
| `bounds/modulo_zero.py` | bounds | VIOLATION | VIOLATION | 2.87 s | ✓ | `a % b` with `b == 0` |
| `bounds/average_zero.py` | bounds | VIOLATION | VIOLATION | 3.08 s | ✓ | `total // n` with `n == 0` |
| `bounds/safe_indexing.py` | bounds | VERIFIED | VERIFIED | 3.18 s | ✓ | Sanity check |
| `concurrency/threading_lock.py` | concurrency | INCONCLUSIVE | INCONCLUSIVE | 1.64 s | ✓ | Routed to `run_deadlock_detector` per paper §3 guidance |

**Verdict-vs-expected (under Python semantics): 13 / 13 (100 %)**

## Aggregate comparison

| Metric | Paper-reported EVA (LLM + C + ESBMC) | This patch (ESBMC-Python direct) |
|---|---|---|
| Detection rate as reported | 100 % (23 / 23) | 100 % (13 / 13) |
| Includes translation-induced false positives | Yes — at least 4 / 23 (17 %) | No |
| Mean wall-clock / program | 45 – 60 s (35 – 50 s ESBMC + 5 – 10 s LLM + iterations) | **3.08 s** |
| Speedup (per-program) | 1.0 × (baseline) | **≈ 14.6 ×** |
| Soundness | Bug-hunting only — no semantic-equivalence guarantee | VERIFIED / VIOLATION authoritative on the accepted Python fragment |
| LLM tokens / program | ~2.6 k prompt + ~0.2 k generation (paper Fig. 7) | 0 (LLM not invoked for the verification step) |

## Headline finding

**Four of the paper's "overflow" detections are translation artefacts, not
Python bugs.** The programs `circle_circumference`, `multiplication`,
`factorial`, and `checksum` are written with values that overflow only at the
**int32 boundary** introduced by the LLM's Python-to-C translation. Python
ints are arbitrary-precision; ESBMC-Python models them as int64, so under
Python semantics no overflow occurs and the assertions hold. ESBMC-Python
correctly returns VERIFIED for all four.

This is exactly the failure mode the patch is designed to suppress: the old
pipeline's "100 % detection rate" silently included false positives that the
paper itself flagged as a possibility (§3.2, *Spurious Faults and Translation
Quality*), but had to mitigate after the fact by replaying counterexamples
against the Python interpreter. The new pipeline avoids them by construction
because the sound backend never sees the C int width.

Genuine Python-level bugs (Pythagorean assertion, int64 overflow in
`power_function`, all four bounds bugs, division-by-zero, modulo-zero) are
detected with concrete counterexamples in **2.6 – 4.9 s per program**, no
LLM invocation, and no translation step.

## Caveats and reproducibility

- The original 23-program benchmark from the paper was not present in this
repo, so the comparison is against a 13-program reconstruction. The
reconstruction was guided by Table 2 of the AISOLA-2026 paper; every
category is represented and the planted bugs follow the same shapes the
paper describes.
- The previous EVA pipeline (LLM-translate + ESBMC-on-C) was not re-run
end-to-end here because the Anthropic SDK was not installed in this
environment and no API key was available. Numbers in the "Paper-reported
EVA" column are quoted verbatim from §5.2, §5.4, and Fig. 7 of the paper.
- The translation-artefact diagnosis was confirmed empirically with two
micro-probes that established ESBMC-Python's integer model is int64 and
not int32 (`a == 2**31 - 1; b = a + 1` → SUCCESSFUL;
`a == 2**63 - 1; b = a + 1` → FAILED with arithmetic-overflow CWE-190/191).
This matches Python's documented integer semantics; the discrepancy with
the paper's "VIOLATION" verdict therefore reflects a translation choice in
the old pipeline, not a difference in solver capability.

## How to reproduce

```bash
# From the repository root, with esbmc on PATH or $ESBMC_PATH set:
python3 agent/benchmark/run_benchmark.py
```

The runner uses `invoke_esbmc` from `esbmc_python_backend.py` directly — the
same code path the agent's `run_esbmc_python` tool takes — so the numbers
here characterise the production verdict pipeline, not a separate harness.
16 changes: 16 additions & 0 deletions agent/benchmark/bounds/average_zero.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
"""Average with zero items — division by zero (paper Table 2)."""
import esbmc

def main():
n = esbmc.nondet_int()
esbmc.__ESBMC_assume(0 <= n <= 10)
total = 0
i = 0
while i < n:
total = total + i
i = i + 1
# When n == 0 the program divides by zero.
average = total // n
assert average >= 0

main()
11 changes: 11 additions & 0 deletions agent/benchmark/bounds/dynamic_index.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
"""Dynamic index bounds violation (paper Table 2)."""
import esbmc

def main():
arr = [10, 20, 30, 40, 50]
idx = esbmc.nondet_int()
esbmc.__ESBMC_assume(0 <= idx <= 5) # 5 is out of bounds for a 5-element list
value = arr[idx]
assert value >= 10

main()
12 changes: 12 additions & 0 deletions agent/benchmark/bounds/modulo_zero.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
"""Modulo by zero (paper Table 2)."""
import esbmc

def main():
a = esbmc.nondet_int()
b = esbmc.nondet_int()
esbmc.__ESBMC_assume(1 <= a <= 1000)
esbmc.__ESBMC_assume(0 <= b <= 10)
result = a % b
assert result >= 0

main()
Loading