Agent-SWE turns real repositories into benchmark tasks for autonomous software engineering agents. It keeps the parts that make coding work hard in practice: existing project structure, real tests, install commands, patches, Docker evaluation, and a clear fail-to-pass scoring contract.
The synthetic task pipeline is inspired by Cursor's public writing on Composer, Composer 2, and Composer 2.5. Cursor described training coding agents on tasks grounded in real codebases, including a feature-deletion style setup: remove a testable behavior, ask the agent to restore it, and use tests as the reward signal. Agent-SWE adapts that idea into an open benchmark-generation workflow for Platform agents.
This project is not affiliated with Cursor. It is an implementation inspired by the public methodology described in their posts and reports.
Most coding benchmarks are either real but scarce, or synthetic but too detached from real development. Agent-SWE aims for the middle ground: tasks are synthetic enough to scale, but grounded enough that agents still need to inspect a real repository, understand context, edit code, and run tests.
A good Agent-SWE task should answer three questions:
- Can the agent understand the existing codebase?
- Can it restore the intended behavior without seeing the oracle patch?
- Can the result pass both targeted reward tests and regression tests?
Cursor's Composer work is the main public inspiration for the synthetic path in Agent-SWE:
- Composer: Building a fast frontier model with RL
- Introducing Composer 2
- A technical report on Composer 2
- Composer 2 Technical Report PDF
- Introducing Composer 2.5
The important idea is simple: instead of only collecting issues and pull requests, generate new tasks from real repositories. In the feature-deletion variant, a known behavior is removed from the codebase, the inverse patch becomes the oracle solution, and tests define whether the agent recovered the behavior.
Agent-SWE currently implements this idea for Python functions and methods. It keeps the public signature, replaces the body with a synthetic failure, writes that mutation to deletion_patch.diff, and stores the inverse repair as patch.diff.
Agent-SWE supports two sources of benchmark tasks:
- Real pull requests mined from GitHub and converted into SWE-style workspaces.
- Synthetic feature-deletion tasks generated from real repositories, inspired by the public Composer 2.5 training method.
Both flows export a workspace that can be evaluated in Docker. The agent being tested should never see the oracle patch or hidden benchmark files.
flowchart LR
Repo[Real repo] --> Build[Build task]
Build --> Export[Export workspace]
Export --> Run[Docker eval]
Run --> Score[Task score]
Score --> Plat[Platform]
git clone https://github.com/PlatformNetwork/Agent-SWE.git
cd Agent-SWE
pip install -e ".[dev]"Set the tokens used by the mining and LLM-assisted parts of the pipeline:
export GITHUB_TOKEN="ghp_..."
export OPENROUTER_API_KEY="************"Use this when you want SWE-bench style tasks from GitHub pull requests.
swe-forge mine mine \
--target 10 \
--output ./tasks.jsonl \
--output-folder ./tasks \
--parallel 8Use this for a known repository and PR number.
swe-forge mine complete \
--repo owner/repo \
--pr 12345 \
--output ./tasks.jsonl \
--model openai/gpt-5.4Use this when you already have a local checkout and know which Python function or method should be removed.
git clone https://github.com/owner/repo.git ./target-repo
swe-forge synthetic generate \
--repo-path ./target-repo \
--repo owner/repo \
--source-file src/package/module.py \
--symbol target_function \
--fail-to-pass "pytest tests/test_target.py -v" \
--pass-to-pass "pytest tests/ -v" \
--install-command "pip install -e ." \
--output-folder ./synthetic_tasks \
--output-jsonl ./synthetic_tasks.jsonl \
--overwriteUse this to confirm that a generated task is valid with its gold solution.
python3 scripts/run_evaluation.py \
--predictions_path gold \
--instance_ids owner-repo-1234 \
--max_workers 4Use this after an agent has produced patches.
python3 scripts/run_evaluation.py \
--predictions_path predictions.jsonl \
--max_workers 4predictions.jsonl contains one prediction per line:
{"instance_id": "owner-repo-1234", "model_patch": "diff --git a/..."}A task workspace is the portable benchmark unit:
tasks/
βββ owner-repo-1234/
βββ workspace.yaml
βββ patch.diff
βββ deletion_patch.diff
βββ test_patch.diff
βββ tests/
βββ run_tests.sh
βββ evaluate.sh
The files have different audiences:
workspace.yamldescribes the task, repo, install commands, tests, and synthetic metadata.patch.diffis the oracle solution and must be hidden from the evaluated agent.deletion_patch.diffis the synthetic mutation applied before evaluation.tests/contains generated or extracted benchmark tests.evaluate.shis a simple local scoring script.
For details, read docs/architecture/workspace-format.md.
The architecture docs explain how the pieces fit together:
ruff format src/ tests/
ruff check src/ tests/
mypy src/
pytest tests/ -vAgent-SWE/
βββ assets/
βββ datasets/
βββ docs/
β βββ architecture/
βββ scripts/
βββ src/swe_forge/
β βββ cli/
β βββ docker_test/
β βββ export/
β βββ swe/
β βββ synthetic/
βββ tests/
Agent-SWE is designed to feed Platform challenge validators with deterministic repository-repair tasks. Validators can sample tasks, run agent patches in isolated workspaces, and turn task completion rates into raw challenge scores for Platform.
Apache-2.0
