αgεηt SWE

Real-code software engineering benchmarks for Platform agents

Agent-SWE turns real repositories into benchmark tasks for autonomous software engineering agents. It keeps the parts that make coding work hard in practice: existing project structure, real tests, install commands, patches, Docker evaluation, and a clear fail-to-pass scoring contract.

The synthetic task pipeline is inspired by Cursor's public writing on Composer, Composer 2, and Composer 2.5. Cursor described training coding agents on tasks grounded in real codebases, including a feature-deletion style setup: remove a testable behavior, ask the agent to restore it, and use tests as the reward signal. Agent-SWE adapts that idea into an open benchmark-generation workflow for Platform agents.

This project is not affiliated with Cursor. It is an implementation inspired by the public methodology described in their posts and reports.

Why Agent-SWE Exists

Most coding benchmarks are either real but scarce, or synthetic but too detached from real development. Agent-SWE aims for the middle ground: tasks are synthetic enough to scale, but grounded enough that agents still need to inspect a real repository, understand context, edit code, and run tests.

A good Agent-SWE task should answer three questions:

Can the agent understand the existing codebase?
Can it restore the intended behavior without seeing the oracle patch?
Can the result pass both targeted reward tests and regression tests?

Inspired by Cursor Composer

Cursor's Composer work is the main public inspiration for the synthetic path in Agent-SWE:

The important idea is simple: instead of only collecting issues and pull requests, generate new tasks from real repositories. In the feature-deletion variant, a known behavior is removed from the codebase, the inverse patch becomes the oracle solution, and tests define whether the agent recovered the behavior.

Agent-SWE currently implements this idea for Python functions and methods. It keeps the public signature, replaces the body with a synthetic failure, writes that mutation to deletion_patch.diff, and stores the inverse repair as patch.diff.

What Agent-SWE Does

Agent-SWE supports two sources of benchmark tasks:

Real pull requests mined from GitHub and converted into SWE-style workspaces.
Synthetic feature-deletion tasks generated from real repositories, inspired by the public Composer 2.5 training method.

Both flows export a workspace that can be evaluated in Docker. The agent being tested should never see the oracle patch or hidden benchmark files.

flowchart LR
    Repo[Real repo] --> Build[Build task]
    Build --> Export[Export workspace]
    Export --> Run[Docker eval]
    Run --> Score[Task score]
    Score --> Plat[Platform]

Install

git clone https://github.com/PlatformNetwork/Agent-SWE.git
cd Agent-SWE
pip install -e ".[dev]"

Set the tokens used by the mining and LLM-assisted parts of the pipeline:

export GITHUB_TOKEN="ghp_..."
export OPENROUTER_API_KEY="************"

Commands

Mine real PR tasks

Use this when you want SWE-bench style tasks from GitHub pull requests.

swe-forge mine mine \
  --target 10 \
  --output ./tasks.jsonl \
  --output-folder ./tasks \
  --parallel 8

Verify one pull request end-to-end

Use this for a known repository and PR number.

swe-forge mine complete \
  --repo owner/repo \
  --pr 12345 \
  --output ./tasks.jsonl \
  --model openai/gpt-5.4

Generate a synthetic feature-deletion task

Use this when you already have a local checkout and know which Python function or method should be removed.

git clone https://github.com/owner/repo.git ./target-repo

swe-forge synthetic generate \
  --repo-path ./target-repo \
  --repo owner/repo \
  --source-file src/package/module.py \
  --symbol target_function \
  --fail-to-pass "pytest tests/test_target.py -v" \
  --pass-to-pass "pytest tests/ -v" \
  --install-command "pip install -e ." \
  --output-folder ./synthetic_tasks \
  --output-jsonl ./synthetic_tasks.jsonl \
  --overwrite

Evaluate the oracle patch

Use this to confirm that a generated task is valid with its gold solution.

python3 scripts/run_evaluation.py \
  --predictions_path gold \
  --instance_ids owner-repo-1234 \
  --max_workers 4

Evaluate model predictions

Use this after an agent has produced patches.

python3 scripts/run_evaluation.py \
  --predictions_path predictions.jsonl \
  --max_workers 4

predictions.jsonl contains one prediction per line:

{"instance_id": "owner-repo-1234", "model_patch": "diff --git a/..."}

Workspace Format

A task workspace is the portable benchmark unit:

tasks/
└── owner-repo-1234/
    ├── workspace.yaml
    ├── patch.diff
    ├── deletion_patch.diff
    ├── test_patch.diff
    ├── tests/
    ├── run_tests.sh
    └── evaluate.sh

The files have different audiences:

workspace.yaml describes the task, repo, install commands, tests, and synthetic metadata.
patch.diff is the oracle solution and must be hidden from the evaluated agent.
deletion_patch.diff is the synthetic mutation applied before evaluation.
tests/ contains generated or extracted benchmark tests.
evaluate.sh is a simple local scoring script.

For details, read docs/architecture/workspace-format.md.

Documentation

The architecture docs explain how the pieces fit together:

Development

ruff format src/ tests/
ruff check src/ tests/
mypy src/
pytest tests/ -v

Repository Layout

Agent-SWE/
├── assets/
├── datasets/
├── docs/
│   └── architecture/
├── scripts/
├── src/swe_forge/
│   ├── cli/
│   ├── docker_test/
│   ├── export/
│   ├── swe/
│   └── synthetic/
└── tests/

Platform Integration

Agent-SWE is designed to feed Platform challenge validators with deterministic repository-repair tasks. Validators can sample tasks, run agent patches in isolated workspaces, and turn task completion rates into raw challenge scores for Platform.

License

Apache-2.0

Name		Name	Last commit message	Last commit date
Latest commit History 226 Commits
.github/workflows		.github/workflows
assets		assets
datasets		datasets
docs/architecture		docs/architecture
scripts		scripts
src		src
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
continuous_pipeline.py		continuous_pipeline.py
pyproject.toml		pyproject.toml
regenerate_tests.py		regenerate_tests.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

αgεηt SWE

Why Agent-SWE Exists

Inspired by Cursor Composer

What Agent-SWE Does

Install

Commands

Mine real PR tasks

Verify one pull request end-to-end

Generate a synthetic feature-deletion task

Evaluate the oracle patch

Evaluate model predictions

Workspace Format

Documentation

Development

Repository Layout

Platform Integration

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

αgεηt SWE

Why Agent-SWE Exists

Inspired by Cursor Composer

What Agent-SWE Does

Install

Commands

Mine real PR tasks

Verify one pull request end-to-end

Generate a synthetic feature-deletion task

Evaluate the oracle patch

Evaluate model predictions

Workspace Format

Documentation

Development

Repository Layout

Platform Integration

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages