Skip to content

Add LogicStar/SWE-Star dataset converter (#170)#205

Open
neubig wants to merge 4 commits into
mainfrom
openhands/issue-170-logicstar-swe-star
Open

Add LogicStar/SWE-Star dataset converter (#170)#205
neubig wants to merge 4 commits into
mainfrom
openhands/issue-170-logicstar-swe-star

Conversation

@neubig
Copy link
Copy Markdown
Contributor

@neubig neubig commented May 14, 2026

Closes #170

This PR was created by an AI agent (OpenHands) on behalf of the user.

Summary

  • Adds datasets/logicstar_swe-star for the Hugging Face dataset LogicStar/SWE-Star.
  • Implements raw extraction, raw schema validation, OpenHands-style API signatures, standardization, README documentation, and generated sample_raw.json, sample_std.json, and sample_sft.json.
  • Registers the dataset in the main README and agents/openhands/DATASETS.md.
  • Preserves function_call and observation roles in the OpenHands SFT converter so generated samples follow ADP SFT role conventions.

Dataset

Schema mapping

  • Raw rows contain timestamp, instance_id, exit_status, stitched, full, result, and resolved.
  • stitched JSON messages are used for conversion.
  • First user message -> TextObservation(source="user").
  • Later execution-result user messages -> TextObservation(source="environment").
  • Assistant <function=execute_bash> -> CodeAction(language="bash").
  • Assistant <function=str_replace_editor> and <function=think> -> ApiAction.
  • Assistant <function=finish>/submit -> terminal MessageAction with <finish> ... </finish>.
  • Only resolved trajectories are emitted for sample generation.

Files added

  • datasets/logicstar_swe-star/README.md
  • datasets/logicstar_swe-star/extract_raw.py
  • datasets/logicstar_swe-star/schema_raw.py
  • datasets/logicstar_swe-star/raw_to_standardized.py
  • datasets/logicstar_swe-star/api.py
  • datasets/logicstar_swe-star/sample_raw.json
  • datasets/logicstar_swe-star/sample_std.json
  • datasets/logicstar_swe-star/sample_sft.json

Design decisions

  • Ambiguity: Whether to convert stitched or full message traces. Chosen approach: Convert stitched because it retains action/observation turns while avoiding the much larger nested/full trace. Example: stitched includes the task, tool calls, and execution outputs needed for ADP. Alternatives rejected: Using full would preserve more scaffolding but produce unnecessarily large samples.
  • Ambiguity: How to represent tool-output user turns. Chosen approach: Convert EXECUTION RESULT of [...] turns to environment observations and strip the prefix. Example: EXECUTION RESULT of [str_replace_editor ...] becomes a TextObservation(source="environment"). Alternatives rejected: Keeping these as user messages would incorrectly train the agent to treat tool output as human instruction.
  • Ambiguity: How to map XML tool calls. Chosen approach: Parse XML-style <function=...> blocks and map shell calls to CodeAction, editor/thinking tools to ApiAction, and terminal submission calls to ADP finish messages. Example: <function=execute_bash><parameter=command>pytest</parameter></function> becomes CodeAction(language="bash"). Alternatives rejected: Keeping raw XML as assistant prose loses executable structure.
  • Ambiguity: How to handle failed trajectories. Chosen approach: Emit only rows with resolved=True for this SFT-focused sample. Example: extract_raw.py and raw_to_standardized.py both filter unresolved rows. Alternatives rejected: Including unresolved trajectories would require additional quality filtering outside the issue scope.
  • Ambiguity: Whether to hand-edit SFT roles after generation. Chosen approach: Fix the shared OpenHands converter to preserve function_call/observation roles, then regenerate samples. Example: SFT messages containing <function=...> now keep from: function_call. Alternatives rejected: Manually patching generated JSON would not be reproducible.
  • Ambiguity: How to document licensing. Chosen approach: State that no license was specified on the HF card or project repository. Example: The dataset README records License: Not specified. Alternatives rejected: Inferring a license from upstream model/project names would be inaccurate.

Validation

  • python -m ruff check datasets/logicstar_swe-star agents/openhands/std_to_sft.py
  • python -m ruff format --check datasets/logicstar_swe-star agents/openhands/std_to_sft.py
  • python -m pytest tests/test_dataset_structure.py tests/test_raw_schemas.py tests/test_standardized_schemas.py tests/test_std_to_sft_conversion.py -k logicstar -v
  • python -m pytest tests/test_datasets_from_parameter.py tests/test_std_to_sft_from_parameter_simple.py -v
  • timeout 60s sh -c 'python datasets/logicstar_swe-star/extract_raw.py | head -1 | python -m json.tool >/dev/null'

Known limitations

  • The upstream dataset is large, so committed samples include three representative resolved trajectories rather than a full corpus export.

@neubig can click here to continue refining the PR

Evidence

Latest CI / validation results

Validation passed on head SHA f157134fafd0050a99f633737c28761f49656a1e:

Cross-dataset converter regression evidence

The successful test (3.11) workflow runs pytest tests/test_*.py for the repository. It covers dataset structure/schema checks plus the shared OpenHands SFT converter paths, including:

tests/test_datasets_from_parameter.py
tests/test_sft_quality_control.py
tests/test_std_to_sft_action_function.py
tests/test_std_to_sft_conversion.py
tests/test_std_to_sft_from_parameter_simple.py
tests/test_std_to_sft_structure.py

This provides regression coverage for the shared agents/openhands/std_to_sft.py fix that preserves ADP-compliant from: function_call values rather than rewriting them to gpt.

Pipeline / runtime status

The committed sample artifacts are validated by the green CI suite above on the current PR head. For dataset PRs, the raw, standardized, and OpenHands SFT samples are covered by the dataset structure, raw schema, standardized schema, and SFT conversion tests in that run.

Conversation link

https://app.all-hands.dev/conversations/248118d2-5d98-47e8-ba10-df4233affe87

Evidence update added by an AI agent (OpenHands) on behalf of the user.

Co-authored-by: openhands <openhands@all-hands.dev>
@neubig neubig added the review-this Trigger the OpenHands PR review workflow label May 15, 2026 — with OpenHands AI
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Acceptable - Core logic is sound (fixes a real spec compliance bug), but documentation and validation gaps need to be addressed.

Summary: This PR adds the LogicStar/SWE-Star dataset and fixes a bug in the shared OpenHands converter where function_call/observation roles were incorrectly converted to gpt/human, violating ADP specifications. The dataset implementation follows repository guidelines, but the shared converter change needs more comprehensive impact documentation.


Dataset Implementation Review

✅ What's Good:

  • All required files present (README, extract_raw.py, raw_to_standardized.py, schema_raw.py, api.py, samples)
  • Sample files are valid JSON with trailing newlines ✓
  • Sample IDs match across raw/std/sft stages ✓
  • Sample size (3 trajectories) is appropriate
  • Design decision catalog is comprehensive and well-documented
  • API functions match ApiAction calls in standardized data
  • Converter fix aligns with ADP spec requirement for function_call roles

❌ Critical Issues:

  1. Missing Evidence section (blocking): PR description has no Evidence section showing end-to-end execution with actual output. The "Validation" section only lists test commands, not concrete runtime artifacts. Per repository requirements, you must provide:

    • Commands used to run the real pipeline end-to-end
    • The resulting output showing the pipeline actually works
    • Tests alone don't count as evidence
  2. Shared converter change lacks impact validation: The change to agents/openhands/std_to_sft.py affects ~20+ datasets but the PR description doesn't acknowledge this or document validation testing on existing datasets.


[RISK ASSESSMENT]

  • [Overall PR] ⚠️ Risk Assessment: 🟡 MEDIUM

Rationale: Changes shared infrastructure (std_to_sft converter) that affects many datasets. While the fix is correct per ADP spec and existing samples appear compatible, the lack of explicit validation testing and evidence increases the risk of unexpected downstream impacts.


VERDICT:
Needs rework: Address the missing Evidence section and document shared converter impact before merging.

KEY INSIGHT:
This PR is actually fixing two things: (1) adding a new dataset, and (2) fixing a spec violation in the shared converter. The second part needs more visibility and validation documentation.


This review was generated by an AI agent (OpenHands).

Improve this review? If any feedback above seems incorrect or irrelevant to this repository, you can teach the reviewer to do better:

  1. Add a .agents/skills/custom-codereview-guide.md file to your branch (or edit it if one already exists) with the /codereview trigger and the context the reviewer is missing (e.g., "Security concerns about X do not apply here because Y"). See the customization docs for the required frontmatter format.
  2. Re-request a review - the reviewer reads guidelines from the PR branch, so your changes take effect immediately.
  3. When your PR is merged, the guideline file goes through normal code review by repository maintainers.

Resolve with AI? Install the iterate skill in your agent and run /iterate to automatically drive this PR through CI, review, and QA until it's merge-ready.

traceback.print_exc()
print(e, file=sys.stderr)
return None
if languages:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Critical - Shared Converter Change: This removal affects ALL datasets using the OpenHands std_to_sft converter (~20+ datasets), not just logicstar_swe-star.

Why this is critical: The old code was converting function_callgpt and observationhuman, which violated the ADP spec requirement that function call messages use from: function_call. Your fix is correct per the spec, but:

  1. The PR description doesn't mention this is a shared converter fix
  2. No documentation of testing impact on existing datasets (e.g., swe-smith, nebius_SWE-agent-trajectories, codeactinstruct, etc.)
  3. Existing samples appear to have been generated before this conversion was added (they already have the correct roles), but this needs verification

Action needed: Add a section to the PR description explaining:

  • That this fixes a spec violation in the shared converter
  • Which datasets are affected
  • Verification that existing samples don't need regeneration (spot-check a few with the modified converter)
  • Or, if samples DO need regeneration, regenerate them in this PR

traceback.print_exc()
print(e, file=sys.stderr)
return None
if languages:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Important - Parallel Work: FYI, there's another branch (openhands/issue-179-openmanus-rl, commit b6fd214) that removes these exact same lines. Both PRs will conflict when merged. Consider coordinating or checking if that PR has already been merged to main.

@neubig neubig removed the review-this Trigger the OpenHands PR review workflow label May 15, 2026
@neubig neubig added the review-this Trigger the OpenHands PR review workflow label May 15, 2026 — with OpenHands AI
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Acceptable - Dataset implementation is excellent and follows all ADP guidelines. The shared converter fix is technically correct but has coordination concerns from previous review.


This review was generated by an AI agent (OpenHands) on behalf of the reviewer.

Comment on lines 298 to 303
if languages:
language_descriptions = get_language_descriptions(languages)
conversations[0]["value"] = language_descriptions + "\n\n" + conversations[0]["value"]
for m in conversations:
if m["from"] == "function_call":
m["from"] = "gpt"
if m["from"] == "observation":
m["from"] = "human"
return {
"id": trajectory.id,
"conversations": conversations,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Previous Review Concern - Still Relevant: The previous review marked this change as 🔴 Critical with two unresolved threads:

  1. Merge conflict risk: Another branch (openhands/issue-179-openmanus-rl, commit b6fd214) removes these same lines
  2. Broad impact: This affects ALL datasets using the OpenHands converter (~20+ datasets)

My verification shows:

  • ✅ The fix is technically correct - the old code violated the ADP spec requirement that function call messages MUST use from="function_call"
  • ✅ All tests pass, including cross-dataset regression tests
  • ✅ The logicstar_swe-star samples correctly use from="function_call" for all 141 function call messages

Recommendation: Before merging, either:

  • Document in the PR description why existing datasets weren't affected (e.g., were they already compliant?)
  • Coordinate with the other branch to avoid conflicts
  • Add a note in the commit message about the cross-dataset impact

This doesn't block the dataset addition (which is excellent), but the coordination concern from the previous review remains valid.

@neubig neubig marked this pull request as ready for review May 15, 2026 02:44
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Good taste - Excellent dataset implementation with proper spec compliance fix

Dataset Implementation

✅ All required files present and correctly structured
✅ Comprehensive design decision catalog (6 decisions documented)
✅ Evidence section with passing CI and cross-dataset validation
✅ API functions properly implemented for all ApiActions
✅ Pre-commit hygiene maintained
✅ Sample files reproducible from documented pipeline commands

The LogicStar/SWE-Star dataset converter follows all ADP repository guidelines. The conversion logic appropriately maps XML tool calls to CodeAction/ApiAction/MessageAction, handles observations with correct sources, and filters to resolved trajectories.

Shared Converter Fix

The change to agents/openhands/std_to_sft.py elegantly fixes a spec compliance bug by removing incorrect role-rewriting code. The old code violated ADP requirements by converting function_callgpt and observationhuman. Cross-dataset regression tests passing (test_std_to_sft_conversion.py covering ~20+ datasets) confirms no breakage.

Outstanding Coordination Concern

Previous reviews flagged coordination with parallel work in branch openhands/issue-179-openmanus-rl that removes the same lines. This is a Git workflow issue, not a code quality issue. Recommend verifying that branch's status before merge to avoid conflicts.

[RISK ASSESSMENT]

  • [Overall PR] ⚠️ Risk Assessment: 🟢 LOW
    • Dataset implementation exemplary, follows all guidelines
    • Converter fix is spec-compliant and regression-tested
    • All validation passing (structure, schema, conversion, cross-dataset)
    • Only concern is Git coordination (process issue, already flagged)

VERDICT:
Worth merging - Code is technically sound and properly tested. Recommend checking parallel PR status to coordinate merge order.

KEY INSIGHT:
This PR demonstrates best practices for dataset contribution: comprehensive design documentation, reproducible pipeline, proper testing, and a clean bug fix that removes complexity rather than adding it.


This review was generated by an AI agent (OpenHands) on behalf of the reviewer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

review-this Trigger the OpenHands PR review workflow

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add dataset: LogicStar/SWE-Star

2 participants