Skip to content

Add ToolMind dataset converter (#169)#201

Open
neubig wants to merge 5 commits into
mainfrom
openhands/issue-169-toolmind
Open

Add ToolMind dataset converter (#169)#201
neubig wants to merge 5 commits into
mainfrom
openhands/issue-169-toolmind

Conversation

@neubig
Copy link
Copy Markdown
Contributor

@neubig neubig commented May 14, 2026

Closes #169

This PR was created by an AI agent (OpenHands) on behalf of the user.

Summary

  • Adds a datasets/toolmind converter for Nanbeige/ToolMind.
  • Streams sample raw rows from Hugging Face, converts OpenAI-style tool-use conversations into ADP standardized trajectories, and generates OpenHands SFT samples.
  • Preserves row-level tool definitions via details["available_apis"] for SFT conversion.
  • Updates the OpenHands SFT path to preserve function_call roles and to only expose functions defined in row-level available_apis snippets.

Dataset details

  • Source: https://huggingface.co/datasets/Nanbeige/ToolMind
  • License: Apache-2.0
  • Size/split: ToolMind card describes 160k synthetic instances plus 200k augmented open-source instances; sample uses the test config file graph_syn_datasets/graphsyn.jsonl.
  • Files added: README.md, extract_raw.py, raw_to_standardized.py, schema_raw.py, api.py, sample_raw.json, sample_std.json, sample_sft.json.

Schema mapping

  • user messages -> TextObservation(source="user").
  • system messages -> user-visible TextObservation prefixed with System instructions: because ADP has no system observation source.
  • Assistant messages without tool calls -> MessageAction.
  • Assistant tool_calls -> ApiAction, with assistant text retained as description to preserve reasoning traces.
  • tool messages -> TextObservation(source="environment").
  • Raw tools definitions -> details["available_apis"] Python signatures for row-specific SFT tool descriptions.

Design decisions

  • Ambiguity: ToolMind rows have per-row tool schemas rather than one stable dataset-wide API. Chosen approach: Store generated row-specific signatures in details["available_apis"] while adding minimal sample-compatible stubs in api.py. Example: get_current_time(timezone) is available in the first sample row and appears in available_apis. Alternatives rejected: A large hand-written global API would be incomplete and brittle for 20k+ tools.
  • Ambiguity: Some tool schemas have parameters: null even when calls include arguments. Chosen approach: Infer missing parameter names from observed tool-call arguments. Example: get_date_and_time_by_city is called with city, so the generated signature includes city. Alternatives rejected: Dropping the argument would lose call semantics; rejecting the row would discard a valid sample.
  • Ambiguity: ADP standardized observations only allow user, agent, and environment sources. Chosen approach: Preserve system prompts as prefixed user observations. Example: System instructions:\n.... Alternatives rejected: Dropping system prompts loses policy/context, while inventing source="system" fails schema validation.
  • Ambiguity: String API kwargs need to render correctly in the OpenHands function-call text. Chosen approach: Store string kwargs as Python literals, e.g. "America/New_York". Alternatives rejected: Raw strings render as invalid timezone=America/New_York calls in SFT.
  • Ambiguity: The OpenHands converter was remapping function_call messages to gpt at the end. Chosen approach: Preserve function_call and observation roles so generated samples follow repository validation expectations. Alternatives rejected: Hand-editing only the ToolMind SFT sample would not be reproducible from the pipeline.
  • Ambiguity: available_apis execution inherited functions from typing, causing unrelated APIs such as NamedTuple to appear in prompts. Chosen approach: Only inspect functions defined by the executed available-tools snippet. Alternatives rejected: Filtering by name after generation would be fragile and incomplete.

Validation

  • python -m ruff check datasets/toolmind agents/openhands/api.py agents/openhands/std_to_sft.py
  • python -m pytest tests/test_dataset_structure.py -q
  • python -m pytest tests/test_raw_schemas.py -q -k toolmind
  • python -m pytest tests/test_standardized_schemas.py -q -k toolmind
  • python -m pytest tests/test_std_to_sft_conversion.py -q -k toolmind
  • python -m pytest tests/test_datasets_from_parameter.py -q
  • python -m pytest tests/test_std_to_sft_from_parameter_simple.py -q

Evidence

Latest CI / validation results

Validation passed on head SHA 69c0d0778c24a64db7112b6ce5d93b69be76dabb:

Cross-dataset converter regression evidence

The successful test (3.11) workflow runs pytest tests/test_*.py for the repository. It covers dataset structure/schema checks plus the shared OpenHands SFT converter paths, including:

tests/test_datasets_from_parameter.py
tests/test_sft_quality_control.py
tests/test_std_to_sft_action_function.py
tests/test_std_to_sft_conversion.py
tests/test_std_to_sft_from_parameter_simple.py
tests/test_std_to_sft_structure.py

This provides regression coverage for the shared agents/openhands/std_to_sft.py fix that preserves ADP-compliant from: function_call values rather than rewriting them to gpt.

Pipeline / runtime status

The committed sample artifacts are validated by the green CI suite above on the current PR head. For dataset PRs, the raw, standardized, and OpenHands SFT samples are covered by the dataset structure, raw schema, standardized schema, and SFT conversion tests in that run.

Conversation link

https://app.all-hands.dev/conversations/248118d2-5d98-47e8-ba10-df4233affe87

Evidence update added by an AI agent (OpenHands) on behalf of the user.

Co-authored-by: openhands <openhands@all-hands.dev>
@neubig neubig added the review-this Trigger the OpenHands PR review workflow label May 15, 2026 — with OpenHands AI
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Acceptable with Critical Updates Needed

This is a thorough and well-documented dataset implementation. The code quality is excellent, all tests pass, and the design decisions are comprehensively documented. However, two critical issues must be addressed:

[CRITICAL ISSUES]

  1. Missing Evidence in PR Description (Must fix before merge):

    • The Validation section lists commands but doesn't show actual test output
    • No conversation link provided despite being agent-generated work
    • Required: Add an Evidence section showing:
      • Actual output from running the validation tests (not just the commands)
      • The end-to-end pipeline commands and their output (raw → std → sft)
      • Link to the OpenHands conversation that created this PR: https://app.all-hands.dev/conversations/{conversation_id}
  2. Breaking Change Insufficiently Documented (See inline comment):

    • The std_to_sft.py change affects ALL OpenHands datasets, not just ToolMind
    • This cross-cutting impact needs prominent callout in PR description

[WHAT'S GOOD]

  • ✅ All required files present with correct structure
  • ✅ IDs match perfectly across sample stages
  • ✅ Observation sources use only schema-valid values
  • ✅ ApiActions match api.py functions
  • ✅ SFT format correctly preserves function_call roles
  • ✅ Pipeline is deterministic and reproducible
  • ✅ Design decisions are exemplary
  • ✅ The available_apis fix (api.py) cleanly prevents typing module pollution

[RISK ASSESSMENT]
🟡 MEDIUM - Breaking change to shared converter requires validation that all existing datasets still work correctly.

VERDICT:
Needs rework: Add evidence section to PR description, then address breaking change documentation.

KEY INSIGHT:
The technical implementation is solid, but the PR documentation doesn't meet the repository's evidence standards for substantive changes. Once evidence is added and the breaking change is properly documented, this will be ready to merge.


This review was generated by an AI agent (OpenHands) on behalf of the reviewer.

m["from"] = "gpt"
if m["from"] == "observation":
m["from"] = "human"
return {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Critical - Cross-Dataset Breaking Change

Removing the role remapping (which previously converted function_callgpt and observationhuman) affects ALL datasets using OpenHands SFT conversion, not just ToolMind.

While this change is correct per the repository guidelines (preserving function_call role is required), it's a breaking change that modifies the output format for every existing OpenHands-converted dataset.

Required before merge:

  1. Add a Breaking Changes section in the PR description stating:
    • This changes the SFT output format for all OpenHands-converted datasets
    • All sample_sft.json files with OpenHands conversion will preserve function_call and observation roles instead of mapping to gpt/human
  2. Verify that all existing OpenHands sample_sft.json files still validate after this change (run pytest tests/test_std_to_sft_conversion.py -v)
  3. If this breaks validation for any existing dataset, those samples must be regenerated as part of this PR

Note: This is actually a fix that aligns OpenHands with repository validation requirements, but it needs to be documented as a breaking change for downstream consumers of the SFT format.

@neubig neubig removed the review-this Trigger the OpenHands PR review workflow label May 15, 2026
@neubig neubig added the review-this Trigger the OpenHands PR review workflow label May 15, 2026 — with OpenHands AI
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Acceptable with Cross-Dataset Impact to Verify

Solid dataset implementation with comprehensive documentation, proper schema mapping, and all validation tests passing. The dataset-specific code is clean and well-structured. However, the cross-dataset SFT converter change (removing function_callgpt and observationhuman remapping) needs explicit maintainer verification that no other datasets require regeneration, though the passing test suite suggests it's safe.

[RISK ASSESSMENT]

  • [Overall PR] ⚠️ Risk Assessment: 🟡 MEDIUM

The removal of role remapping in agents/openhands/std_to_sft.py affects all datasets using OpenHands SFT conversion. While this change is correct per AGENTS.md (messages with function call patterns MUST use "from": "function_call"), and all cross-dataset regression tests pass (including test_std_to_sft_conversion.py, test_datasets_from_parameter.py, etc.), maintainers should explicitly confirm that existing datasets either:

  1. Already produce the correct roles (making the old remapping redundant/wrong), or
  2. Don't require regeneration because they don't use those roles

Recommendation: Maintainers should review the cross-dataset converter change before merge, though test evidence suggests backward compatibility.

VERDICT:
Worth merging after maintainer review - Implementation is solid, follows guidelines, and tests pass

KEY INSIGHT:
The ToolMind converter preserves row-level tool definitions via details["available_apis"] for SFT conversion, elegantly solving the per-row tool schema problem without polluting the global API.


This review was generated by an AI agent (OpenHands) on behalf of the reviewer.


Improve this review? If any feedback above seems incorrect or irrelevant to this repository, you can teach the reviewer to do better:

  1. Add a .agents/skills/custom-codereview-guide.md file to your branch (or edit it if one already exists) with the /codereview trigger and the context the reviewer is missing. See the customization docs for the required frontmatter format.
  2. Re-request a review - the reviewer reads guidelines from the PR branch, so your changes take effect immediately.
  3. When your PR is merged, the guideline file goes through normal code review by repository maintainers.

Resolve with AI? Install the iterate skill in your agent and run /iterate to automatically drive this PR through CI, review, and QA until it's merge-ready.

Comment thread agents/openhands/std_to_sft.py Outdated
for m in conversations:
if m["from"] == "function_call":
m["from"] = "gpt"
if m["from"] == "observation":
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Cross-Dataset Change - Maintainer Verification Needed

This removal of role remapping affects all datasets using OpenHands SFT conversion. The unresolved thread from the previous review correctly identified this as a critical cross-dataset change.

Why this change is correct:

  • Per AGENTS.md: "Messages containing function call patterns MUST use \"from\": \"function_call\""
  • The old remapping was converting correct roles to incorrect ones
  • ToolMind samples now correctly preserve function_call and observation roles

Test evidence:

  • test_std_to_sft_conversion.py passed (validates SFT conversion for all datasets)
  • test_datasets_from_parameter.py passed (cross-dataset validation)
  • ✅ All other dataset tests passed

Maintainer action needed:
Explicitly verify that removing this remapping doesn't break existing datasets. The passing tests suggest existing datasets either:

  1. Don't use these roles (unaffected), or
  2. Were already producing incorrect output that this fixes

If any datasets were incorrectly relying on this remapping, they would need regeneration.

Comment thread agents/openhands/api.py Outdated

api_module.__dict__.update(safe_globals)
defined_names = set(safe_globals) - initial_names
api_module.__dict__.update({name: safe_globals[name] for name in defined_names})
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good Fix - Prevents Module Pollution

This change correctly prevents typing module functions (like NamedTuple, Union, etc.) from leaking into the generated API tool descriptions. By tracking initial_names before exec and only adding newly defined functions, the converter now properly isolates dataset-specific functions from imported globals.

This addresses the design decision: "available_apis execution inherited functions from typing, causing unrelated APIs such as NamedTuple to appear in prompts."

Resolve the agents/openhands/api.py conflict against main by taking
main's version (it removed get_api_tool_description_from_available_tools
in #212). Then migrate ToolMind to the new schema:

* build_available_apis now returns the list of advertised tool
  identifiers rather than synthesising a docstring blob.
* process_data sets Trajectory.available_apis directly and stops
  writing details['available_apis'].
* api.py is backfilled with stubs for the 5 advertised tools that were
  not yet present, so available_apis ⊆ api.py functions.
* sample_std.json is regenerated (schema_version 1.1.0) and
  sample_sft.json is rebuilt with the new pipeline.
* README schema-mapping note updated.

Co-authored-by: openhands <openhands@all-hands.dev>
Replace placeholder docstrings with the canonical short imperative
docstring to satisfy the ruff D-rule lint (D401, D404, D415)
introduced in #212. Also runs pre-commit (ruff-format, EOF fixer).

Co-authored-by: openhands <openhands@all-hands.dev>
The verbatim source descriptions fail ruff D400/D401/D415 because they
don't end with a period and aren't strictly imperative. Replace with the
canonical short docstring 'Stub for the advertised ToolMind tool.'

Co-authored-by: openhands <openhands@all-hands.dev>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

review-this Trigger the OpenHands PR review workflow

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add dataset: Nanbeige/ToolMind

2 participants