Add ToolMind dataset converter (#169)#201
Conversation
Co-authored-by: openhands <openhands@all-hands.dev>
There was a problem hiding this comment.
🟡 Acceptable with Critical Updates Needed
This is a thorough and well-documented dataset implementation. The code quality is excellent, all tests pass, and the design decisions are comprehensively documented. However, two critical issues must be addressed:
[CRITICAL ISSUES]
-
Missing Evidence in PR Description (Must fix before merge):
- The Validation section lists commands but doesn't show actual test output
- No conversation link provided despite being agent-generated work
- Required: Add an
Evidencesection showing:- Actual output from running the validation tests (not just the commands)
- The end-to-end pipeline commands and their output (raw → std → sft)
- Link to the OpenHands conversation that created this PR:
https://app.all-hands.dev/conversations/{conversation_id}
-
Breaking Change Insufficiently Documented (See inline comment):
- The
std_to_sft.pychange affects ALL OpenHands datasets, not just ToolMind - This cross-cutting impact needs prominent callout in PR description
- The
[WHAT'S GOOD]
- ✅ All required files present with correct structure
- ✅ IDs match perfectly across sample stages
- ✅ Observation sources use only schema-valid values
- ✅ ApiActions match api.py functions
- ✅ SFT format correctly preserves function_call roles
- ✅ Pipeline is deterministic and reproducible
- ✅ Design decisions are exemplary
- ✅ The available_apis fix (api.py) cleanly prevents typing module pollution
[RISK ASSESSMENT]
🟡 MEDIUM - Breaking change to shared converter requires validation that all existing datasets still work correctly.
VERDICT:
❌ Needs rework: Add evidence section to PR description, then address breaking change documentation.
KEY INSIGHT:
The technical implementation is solid, but the PR documentation doesn't meet the repository's evidence standards for substantive changes. Once evidence is added and the breaking change is properly documented, this will be ready to merge.
This review was generated by an AI agent (OpenHands) on behalf of the reviewer.
| m["from"] = "gpt" | ||
| if m["from"] == "observation": | ||
| m["from"] = "human" | ||
| return { |
There was a problem hiding this comment.
🔴 Critical - Cross-Dataset Breaking Change
Removing the role remapping (which previously converted function_call → gpt and observation → human) affects ALL datasets using OpenHands SFT conversion, not just ToolMind.
While this change is correct per the repository guidelines (preserving function_call role is required), it's a breaking change that modifies the output format for every existing OpenHands-converted dataset.
Required before merge:
- Add a Breaking Changes section in the PR description stating:
- This changes the SFT output format for all OpenHands-converted datasets
- All
sample_sft.jsonfiles with OpenHands conversion will preservefunction_callandobservationroles instead of mapping togpt/human
- Verify that all existing OpenHands
sample_sft.jsonfiles still validate after this change (runpytest tests/test_std_to_sft_conversion.py -v) - If this breaks validation for any existing dataset, those samples must be regenerated as part of this PR
Note: This is actually a fix that aligns OpenHands with repository validation requirements, but it needs to be documented as a breaking change for downstream consumers of the SFT format.
There was a problem hiding this comment.
🟡 Acceptable with Cross-Dataset Impact to Verify
Solid dataset implementation with comprehensive documentation, proper schema mapping, and all validation tests passing. The dataset-specific code is clean and well-structured. However, the cross-dataset SFT converter change (removing function_call→gpt and observation→human remapping) needs explicit maintainer verification that no other datasets require regeneration, though the passing test suite suggests it's safe.
[RISK ASSESSMENT]
- [Overall PR]
⚠️ Risk Assessment: 🟡 MEDIUM
The removal of role remapping in agents/openhands/std_to_sft.py affects all datasets using OpenHands SFT conversion. While this change is correct per AGENTS.md (messages with function call patterns MUST use "from": "function_call"), and all cross-dataset regression tests pass (including test_std_to_sft_conversion.py, test_datasets_from_parameter.py, etc.), maintainers should explicitly confirm that existing datasets either:
- Already produce the correct roles (making the old remapping redundant/wrong), or
- Don't require regeneration because they don't use those roles
Recommendation: Maintainers should review the cross-dataset converter change before merge, though test evidence suggests backward compatibility.
VERDICT:
✅ Worth merging after maintainer review - Implementation is solid, follows guidelines, and tests pass
KEY INSIGHT:
The ToolMind converter preserves row-level tool definitions via details["available_apis"] for SFT conversion, elegantly solving the per-row tool schema problem without polluting the global API.
This review was generated by an AI agent (OpenHands) on behalf of the reviewer.
Improve this review? If any feedback above seems incorrect or irrelevant to this repository, you can teach the reviewer to do better:
- Add a
.agents/skills/custom-codereview-guide.mdfile to your branch (or edit it if one already exists) with the/codereviewtrigger and the context the reviewer is missing. See the customization docs for the required frontmatter format.- Re-request a review - the reviewer reads guidelines from the PR branch, so your changes take effect immediately.
- When your PR is merged, the guideline file goes through normal code review by repository maintainers.
Resolve with AI? Install the iterate skill in your agent and run
/iterateto automatically drive this PR through CI, review, and QA until it's merge-ready.
| for m in conversations: | ||
| if m["from"] == "function_call": | ||
| m["from"] = "gpt" | ||
| if m["from"] == "observation": |
There was a problem hiding this comment.
🟡 Cross-Dataset Change - Maintainer Verification Needed
This removal of role remapping affects all datasets using OpenHands SFT conversion. The unresolved thread from the previous review correctly identified this as a critical cross-dataset change.
Why this change is correct:
- Per AGENTS.md: "Messages containing function call patterns MUST use
\"from\": \"function_call\"" - The old remapping was converting correct roles to incorrect ones
- ToolMind samples now correctly preserve
function_callandobservationroles
Test evidence:
- ✅
test_std_to_sft_conversion.pypassed (validates SFT conversion for all datasets) - ✅
test_datasets_from_parameter.pypassed (cross-dataset validation) - ✅ All other dataset tests passed
Maintainer action needed:
Explicitly verify that removing this remapping doesn't break existing datasets. The passing tests suggest existing datasets either:
- Don't use these roles (unaffected), or
- Were already producing incorrect output that this fixes
If any datasets were incorrectly relying on this remapping, they would need regeneration.
|
|
||
| api_module.__dict__.update(safe_globals) | ||
| defined_names = set(safe_globals) - initial_names | ||
| api_module.__dict__.update({name: safe_globals[name] for name in defined_names}) |
There was a problem hiding this comment.
✅ Good Fix - Prevents Module Pollution
This change correctly prevents typing module functions (like NamedTuple, Union, etc.) from leaking into the generated API tool descriptions. By tracking initial_names before exec and only adding newly defined functions, the converter now properly isolates dataset-specific functions from imported globals.
This addresses the design decision: "available_apis execution inherited functions from typing, causing unrelated APIs such as NamedTuple to appear in prompts."
Resolve the agents/openhands/api.py conflict against main by taking main's version (it removed get_api_tool_description_from_available_tools in #212). Then migrate ToolMind to the new schema: * build_available_apis now returns the list of advertised tool identifiers rather than synthesising a docstring blob. * process_data sets Trajectory.available_apis directly and stops writing details['available_apis']. * api.py is backfilled with stubs for the 5 advertised tools that were not yet present, so available_apis ⊆ api.py functions. * sample_std.json is regenerated (schema_version 1.1.0) and sample_sft.json is rebuilt with the new pipeline. * README schema-mapping note updated. Co-authored-by: openhands <openhands@all-hands.dev>
Replace placeholder docstrings with the canonical short imperative docstring to satisfy the ruff D-rule lint (D401, D404, D415) introduced in #212. Also runs pre-commit (ruff-format, EOF fixer). Co-authored-by: openhands <openhands@all-hands.dev>
The verbatim source descriptions fail ruff D400/D401/D415 because they don't end with a period and aren't strictly imperative. Replace with the canonical short docstring 'Stub for the advertised ToolMind tool.' Co-authored-by: openhands <openhands@all-hands.dev>
Closes #169
This PR was created by an AI agent (OpenHands) on behalf of the user.
Summary
datasets/toolmindconverter forNanbeige/ToolMind.details["available_apis"]for SFT conversion.function_callroles and to only expose functions defined in row-levelavailable_apissnippets.Dataset details
testconfig filegraph_syn_datasets/graphsyn.jsonl.README.md,extract_raw.py,raw_to_standardized.py,schema_raw.py,api.py,sample_raw.json,sample_std.json,sample_sft.json.Schema mapping
usermessages ->TextObservation(source="user").systemmessages -> user-visibleTextObservationprefixed withSystem instructions:because ADP has nosystemobservation source.MessageAction.tool_calls->ApiAction, with assistant text retained asdescriptionto preserve reasoning traces.toolmessages ->TextObservation(source="environment").toolsdefinitions ->details["available_apis"]Python signatures for row-specific SFT tool descriptions.Design decisions
details["available_apis"]while adding minimal sample-compatible stubs inapi.py. Example:get_current_time(timezone)is available in the first sample row and appears inavailable_apis. Alternatives rejected: A large hand-written global API would be incomplete and brittle for 20k+ tools.parameters: nulleven when calls include arguments. Chosen approach: Infer missing parameter names from observed tool-call arguments. Example:get_date_and_time_by_cityis called withcity, so the generated signature includescity. Alternatives rejected: Dropping the argument would lose call semantics; rejecting the row would discard a valid sample.user,agent, andenvironmentsources. Chosen approach: Preserve system prompts as prefixed user observations. Example:System instructions:\n.... Alternatives rejected: Dropping system prompts loses policy/context, while inventingsource="system"fails schema validation."America/New_York". Alternatives rejected: Raw strings render as invalidtimezone=America/New_Yorkcalls in SFT.function_callmessages togptat the end. Chosen approach: Preservefunction_callandobservationroles so generated samples follow repository validation expectations. Alternatives rejected: Hand-editing only the ToolMind SFT sample would not be reproducible from the pipeline.available_apisexecution inherited functions fromtyping, causing unrelated APIs such asNamedTupleto appear in prompts. Chosen approach: Only inspect functions defined by the executed available-tools snippet. Alternatives rejected: Filtering by name after generation would be fragile and incomplete.Validation
python -m ruff check datasets/toolmind agents/openhands/api.py agents/openhands/std_to_sft.pypython -m pytest tests/test_dataset_structure.py -qpython -m pytest tests/test_raw_schemas.py -q -k toolmindpython -m pytest tests/test_standardized_schemas.py -q -k toolmindpython -m pytest tests/test_std_to_sft_conversion.py -q -k toolmindpython -m pytest tests/test_datasets_from_parameter.py -qpython -m pytest tests/test_std_to_sft_from_parameter_simple.py -qEvidence
Latest CI / validation results
Validation passed on head SHA
69c0d0778c24a64db7112b6ce5d93b69be76dabb:pr-review: SUCCESS — https://github.com/neulab/agent-data-protocol/actions/runs/25895846267/job/76108646390pr-review: SKIPPED — https://github.com/neulab/agent-data-protocol/actions/runs/25895846388/job/76108647083test (3.11): SUCCESS — https://github.com/neulab/agent-data-protocol/actions/runs/25841406918/job/75927296187check_docstrings: SUCCESS — https://github.com/neulab/agent-data-protocol/actions/runs/25841406923/job/75927296069pre-commit: SUCCESS — https://github.com/neulab/agent-data-protocol/actions/runs/25841406946/job/75927296063Cross-dataset converter regression evidence
The successful
test (3.11)workflow runspytest tests/test_*.pyfor the repository. It covers dataset structure/schema checks plus the shared OpenHands SFT converter paths, including:This provides regression coverage for the shared
agents/openhands/std_to_sft.pyfix that preserves ADP-compliantfrom: function_callvalues rather than rewriting them togpt.Pipeline / runtime status
The committed sample artifacts are validated by the green CI suite above on the current PR head. For dataset PRs, the raw, standardized, and OpenHands SFT samples are covered by the dataset structure, raw schema, standardized schema, and SFT conversion tests in that run.
Conversation link
https://app.all-hands.dev/conversations/248118d2-5d98-47e8-ba10-df4233affe87
Evidence update added by an AI agent (OpenHands) on behalf of the user.