Add MiroVerse v0.1 dataset converter (#171) by neubig · Pull Request #206 · neulab/agent-data-protocol

neubig · 2026-05-14T04:38:23Z

Closes #171

This PR was created by an AI agent (OpenHands) on behalf of the user.

Summary

Adds datasets/miroverse_v0_1 for the SFT portion of miromind-ai/MiroVerse-v0.1.
Implements raw extraction from the documented Hugging Face JSONL files, a raw Pydantic schema, MCP tool-call API validation, raw-to-standardized conversion, and generated raw/std/OpenHands SFT samples.
Extracts row-specific MCP tool declarations from MiroVerse system prompts, converts them into direct per-tool ADP ApiAction calls plus details["available_apis"], and regenerates samples so tool-calling SFT prompts expose the actual tools instead of only a generic use_mcp_tool wrapper.
Updates the OpenHands dynamic available_apis loader to inspect only functions defined by the per-instance API string, avoiding unrelated typing helper functions in generated tool docs.

Dataset details

Source: miromind-ai/MiroVerse-v0.1
License: hybrid per dataset card; trace data is CC-BY-NC-4.0 while query/answer data retains original source licenses.
Size/split: dataset card lists 147,985 SFT train samples across 12 JSONL configs; the issue's HF viewer metadata reports approximately 227,584 rows including additional configurations.
Included split/configs: train JSONL SFT configs (MiroVerse-Voyager1.0, MiroVerse-MuSiQue, MiroVerse-HotpotQA, MiroVerse-WebWalkerQA-Silver, MiroVerse-MegaScience, MiroVerse-TaskCraft, MiroVerse-QA-Expert-Multi-Hop-V1.0, MiroVerse-OneGen-TrainDataset-MultiHopQA, MiroVerse-2WikiMultihopQA, MiroVerse-WikiTables, MiroVerse-WebShaper, MiroVerse-WebDancer). DPO files and the zip aggregate are intentionally excluded.

Files added

datasets/miroverse_v0_1/README.md
datasets/miroverse_v0_1/extract_raw.py
datasets/miroverse_v0_1/schema_raw.py
datasets/miroverse_v0_1/api.py
agents/openhands/api.py (dynamic available_apis filtering fix)
datasets/miroverse_v0_1/raw_to_standardized.py
datasets/miroverse_v0_1/requirements.txt
datasets/miroverse_v0_1/sample_raw.json
datasets/miroverse_v0_1/sample_std.json
datasets/miroverse_v0_1/sample_sft.json

Schema mapping summary

Raw rows are OpenAI-style messages with system, user, and assistant roles plus a split label.
extract_raw.py parses the per-row MCP tool inventory from system-prompt JSON-schema blocks into available_tools.
system messages are preserved in Trajectory.details["system_prompt"] rather than emitted as conversation turns.
raw_to_standardized.py converts available_tools into Trajectory.details["available_apis"], matching the per-instance tool-doc pattern used by other tool-calling datasets.
user messages become TextObservation(source="user"), except the user message immediately following a parsed MCP call becomes TextObservation(source="environment") because MiroVerse stores tool results as user-role messages.
Assistant <use_mcp_tool>...</use_mcp_tool> blocks become direct per-tool ApiAction calls such as tool_google_search__scrape(...); preceding assistant reasoning is retained as the action description.
Other assistant messages become MessageAction; the final assistant response is wrapped as a finish action during standardization.

Design decisions

Ambiguity: The source repository is gated on Hugging Face, while validation needs committed sample files. Chosen approach: extract_raw.py defaults to the original dataset and supports HF_TOKEN, but the sample can also be regenerated from an equivalent flat-layout mirror via environment variables. Example: the committed sample was generated with MIROVERSE_SOURCE_DATASET=WaltonFuture/agentic-sft-new MIROVERSE_FLAT_LAYOUT=1 for three same-named MiroVerse JSONL configs because this runtime did not have gated-source access. Alternatives rejected: hand-writing placeholder samples would not be reproducible; committing downloaded full data would be too large.
Ambiguity: MiroVerse exposes row-specific MCP tools only inside a long system prompt rather than in a structured column. Chosen approach: parse the ## Server name / ### Tool name / Input JSON schema blocks during extraction, store them as available_tools, and generate details["available_apis"] Python wrappers during standardization. Example: <server_name>tool-google-search</server_name><tool_name>scrape</tool_name> becomes ApiAction(function="tool_google_search__scrape", kwargs={"url": "'https://...'"}) with a matching per-instance function signature in available_apis. Alternatives rejected: keeping only a generic use_mcp_tool hides the actual tool inventory from tool-calling agents; hard-coding one global API file cannot represent tools that vary by instance.
Ambiguity: The dynamic available_apis loader seeded its exec namespace with typing helpers, which caused unrelated typing functions to appear as tools. Chosen approach: filter the executed namespace to only functions introduced or overridden by the per-instance API string. Example: generated SFT prompts now list tool_serper_search__google_search and tool_serper_search__scrape without typing.NamedTuple/typing.cast. Alternatives rejected: adding cleanup code to each generated dataset API string would be dataset-local and fragile; leaving the loader unchanged pollutes every dynamic tool prompt.
Ambiguity: MiroVerse stores tool results as user messages. Chosen approach: only the user message immediately after a parsed MCP tool call is mapped to source="environment". Example: a browsing-agent result following <use_mcp_tool> becomes an environment observation, while the original question and final-answer summarization prompt remain user observations. Alternatives rejected: mapping all user messages to user would misclassify tool outputs; mapping all post-initial user messages to environment would lose real follow-up prompts.
Ambiguity: The raw system prompt is very large and describes MiroVerse's native tool environment. Chosen approach: preserve it in standardized trajectory details, not as a dialogue turn. Example: details["system_prompt"] keeps the original prompt for traceability while SFT starts with the actual user task and ADP tool docs. Alternatives rejected: emitting it as an environment observation creates awkward leading observation turns; dropping it entirely loses provenance.
Ambiguity: Assistant MCP XML includes both reasoning and the tool call. Chosen approach: convert the XML block into ApiAction(function="use_mcp_tool") and keep the reasoning as description. Example: an assistant plan followed by <tool_name>search_and_browse</tool_name> becomes one API action with the plan as description. Alternatives rejected: leaving the whole assistant message as plain text loses executable structure; splitting the reasoning into a separate assistant message creates consecutive assistant turns before a tool call.
Ambiguity: Plain final answers are not explicit ADP tool calls. Chosen approach: wrap only the last assistant message as <finish> during standardization. Example: \boxed{2011-04-02} becomes a finish action in OpenHands SFT. Alternatives rejected: wrapping all assistant answers would incorrectly turn intermediate answers into terminal states.
Ambiguity: The shared OpenHands converter was rewriting generated function_call and observation roles. Chosen approach: keep those roles and quote dataset-specific API arguments in generated execution code. Example: MiroVerse use_mcp_tool(server_name='browsing-agent', ...) is emitted under from: function_call. Alternatives rejected: hand-patching sample roles would not be reproducible; leaving function-call syntax under gpt fails the repository's role convention.

Tests run

PYTHONPATH=$PWD python -m pytest tests/test_dataset_structure.py tests/test_raw_schemas.py tests/test_standardized_schemas.py tests/test_std_to_sft_conversion.py -k 'miroverse_v0_1 or test_dataset_structure' -v
PYTHONPATH=$PWD python -m pytest tests/test_std_to_sft_action_function.py tests/test_std_to_sft_structure.py tests/test_sft_quality_control.py -v
PYTHONPATH=$PWD python -m pytest tests/test_datasets_from_parameter.py -v
python -m ruff check agents/openhands/std_to_sft.py datasets/miroverse_v0_1
git --no-pager diff --check

Additional validation after per-instance tool extraction update:

python -m ruff check agents/openhands/api.py datasets/miroverse_v0_1
PYTHONPATH=$PWD python -m pytest tests/test_dataset_structure.py tests/test_raw_schemas.py tests/test_standardized_schemas.py tests/test_std_to_sft_conversion.py -q -k 'miroverse_v0_1 or dataset_structure'
PYTHONPATH=$PWD python -m pytest tests/test_std_to_sft_action_function.py tests/test_std_to_sft_structure.py tests/test_sft_quality_control.py -q
PYTHONPATH=$PWD python -m pytest tests/test_datasets_from_parameter.py -q
git --no-pager diff --check

Known limitations

The default source dataset is gated. Full extraction requires accepting the Hugging Face dataset terms and providing an authorized HF_TOKEN.
Installing the entire repository requirements.txt in this Python 3.13 runtime failed because browsergym-core pins playwright==1.44, whose greenlet==3.0.3 dependency does not build on Python 3.13. I installed the minimal packages needed for validation individually and ran the tests above.

Evidence

Latest CI / validation results

Validation passed on head SHA d6e3681bb57e887bf61975125475b6f9789c6ac2:

pr-review: SUCCESS — https://github.com/neulab/agent-data-protocol/actions/runs/25896000654/job/76109116161
pr-review: SKIPPED — https://github.com/neulab/agent-data-protocol/actions/runs/25896000791/job/76109116638
check_docstrings: SUCCESS — https://github.com/neulab/agent-data-protocol/actions/runs/25895984735/job/76109067242
pre-commit: SUCCESS — https://github.com/neulab/agent-data-protocol/actions/runs/25895984721/job/76109067284
test (3.11): SUCCESS — https://github.com/neulab/agent-data-protocol/actions/runs/25895984725/job/76109067204

Cross-dataset converter regression evidence

The successful test (3.11) workflow runs pytest tests/test_*.py for the repository. It covers dataset structure/schema checks plus the shared OpenHands SFT converter paths, including:

tests/test_datasets_from_parameter.py
tests/test_sft_quality_control.py
tests/test_std_to_sft_action_function.py
tests/test_std_to_sft_conversion.py
tests/test_std_to_sft_from_parameter_simple.py
tests/test_std_to_sft_structure.py

This provides regression coverage for the shared agents/openhands/std_to_sft.py fix that preserves ADP-compliant from: function_call values rather than rewriting them to gpt.

Pipeline / runtime status

The committed sample artifacts are validated by the green CI suite above on the current PR head. For dataset PRs, the raw, standardized, and OpenHands SFT samples are covered by the dataset structure, raw schema, standardized schema, and SFT conversion tests in that run.

Conversation link

https://app.all-hands.dev/conversations/248118d2-5d98-47e8-ba10-df4233affe87

Evidence update added by an AI agent (OpenHands) on behalf of the user.

Co-authored-by: openhands <openhands@all-hands.dev>

github-actions

Review Summary

This PR adds the MiroVerse v0.1 dataset with excellent documentation and a comprehensive design-decision catalog. However, there are critical issues that must be addressed before merge:

Missing Evidence: The PR description lacks an Evidence section showing end-to-end pipeline execution
Undocumented Breaking Change: The std_to_sft.py modification affects ALL datasets but isn't mentioned in the description
Verification Gap: No confirmation that existing datasets still validate after the shared converter change

The dataset implementation itself follows ADP guidelines well, but the shared converter change needs proper documentation and impact verification.

This review was generated by an AI agent (OpenHands) on behalf of the code review process.

github-actions · 2026-05-15T01:51:51Z

-            m["from"] = "human"
    return {
        "id": trajectory.id,
        "conversations": conversations,


🔴 Critical - Undocumented Breaking Change:

You've removed the code that was converting function_call → gpt and observation → human. While this is actually fixing a bug (repo guidelines require from: function_call for function calls), this change affects ALL datasets using the OpenHands converter.

Required actions:

Document this breaking change in the PR description

Explain that this fixes role assignment to match ADP guidelines

Verify that existing dataset samples still validate (run pytest tests/test_sft_quality_control.py -v)

If any existing samples now fail, regenerate them in this PR

Without verification that existing datasets aren't broken, this change is too risky to merge.

github-actions · 2026-05-15T01:51:51Z

+def _open_hf_file(path):
+    url = f"https://huggingface.co/datasets/{SOURCE_DATASET}/resolve/main/{_resolve_path(path)}"
+    headers = {}
+    token = os.environ.get("HF_TOKEN")


🟡 Suggestion - Error Message Clarity:

The error message could be more helpful by specifying the exact issue. Consider:

Suggested change

token = os.environ.get("HF_TOKEN")

raise RuntimeError(

f"MiroVerse-v0.1 is gated on Hugging Face (HTTP {exc.code}). "

"Accept the dataset terms at https://huggingface.co/datasets/miromind-ai/MiroVerse-v0.1 "

"and provide an authorized HF_TOKEN, or set MIROVERSE_SOURCE_DATASET and "

"MIROVERSE_FLAT_LAYOUT for a mirror with the same JSONL files."

) from exc

This helps users understand they need to accept terms first, not just provide a token.

github-actions · 2026-05-15T01:51:51Z

+
+def _parse_arguments(raw_arguments: str | None) -> dict[str, Any] | str:
+    if raw_arguments is None:
+        return {}


🟠 Important - Error Handling:

If JSON parsing fails, returning the raw string is good, but you should log or track this for debugging. Consider:

Suggested change

return {}

try:

return json.loads(text)

except json.JSONDecodeError as e:

import sys

print(f"Warning: Failed to parse arguments as JSON: {e}", file=sys.stderr)

return text

This helps identify datasets with malformed tool arguments during extraction.

Co-authored-by: openhands <openhands@all-hands.dev>

github-actions

Taste Rating: 🟡 Acceptable - Solid dataset work with excellent design documentation, but missing critical evidence and cross-dataset verification.

[CRITICAL ISSUES]

Missing Evidence Section (PR Description)

The PR description must include an Evidence section showing that the pipeline actually works end-to-end. Per repository guidelines:

For dataset conversions, show the actual commands and their output for:
1. Extracting raw samples: python datasets/miroverse_v0_1/extract_raw.py | head -5
2. Converting to standardized: cat datasets/miroverse_v0_1/sample_raw.json | python scripts/json_to_jsonl.py | python datasets/miroverse_v0_1/raw_to_standardized.py
3. Converting to SFT: cat datasets/miroverse_v0_1/sample_std.json | python scripts/json_to_jsonl.py | python agents/openhands/std_to_sft.py --is_web=no --api_env=execute_bash
Test output alone (pytest results) is not sufficient - we need proof the actual conversion scripts work
If this work came from an agent conversation, include the conversation URL

Breaking Change Verification Gap (see inline comment on agents/openhands/std_to_sft.py)

The removal of the function_call → gpt rewriting affects ALL datasets. Need evidence that existing datasets still validate.

[IMPROVEMENT OPPORTUNITIES]

Error logging for JSON parse failures (see inline comment)
More specific HTTP error messages (see inline comment)

[POSITIVE OBSERVATIONS]

✅ Excellent design-decision catalog - Thoroughly documents ambiguities and rationale
✅ Proper schema mapping - Correctly maps MCP tool calls to ApiAction
✅ Complete required files - All mandatory dataset files present
✅ Handles gated dataset - Environment variable approach for access control is pragmatic

[RISK ASSESSMENT]

⚠️ Risk Level: 🟡 MEDIUM

Key factors:

✅ New dataset addition (low risk to existing functionality)
⚠️ Shared converter change affects all OpenHands SFT samples (medium risk)
✅ Gated source requires manual access (limits reproducibility but documented)
✅ No security-sensitive operations (environment-based extraction only)

Recommendation:

Add Evidence section with actual pipeline output
Verify the std_to_sft.py change against existing datasets before merge
Once verified, this can be safely merged

VERDICT

❌ Needs rework: Add missing evidence and verify cross-dataset impact

KEY INSIGHT: This PR demonstrates excellent dataset contribution practices (thorough documentation, proper schema mapping), but skips the critical reproducibility proof and cross-dataset verification that prevent regressions in a multi-dataset repository.

Improve this review? If any feedback above seems incorrect or irrelevant to this repository, you can teach the reviewer to do better:

Add a .agents/skills/custom-codereview-guide.md file to your branch (or edit it if one already exists) with the /codereview trigger and the context the reviewer is missing. See the customization docs for the required frontmatter format.

Re-request a review - the reviewer reads guidelines from the PR branch, so your changes take effect immediately.

When your PR is merged, the guideline file goes through normal code review by repository maintainers.

Resolve with AI? Install the iterate skill in your agent and run /iterate to automatically drive this PR through CI, review, and QA until it's merge-ready.

github-actions · 2026-05-15T01:57:08Z

-            m["from"] = "gpt"
-        if m["from"] == "observation":
-            m["from"] = "human"
    return {


🔴 Critical - Breaking Change Verification: You removed the code that was rewriting function_call → gpt and observation → human. While this is fixing a bug per repository guidelines (function calls must use from: function_call), this change affects ALL datasets using the OpenHands converter.

The PR description does not show that you:

Tested this change against existing sample_sft.json files in other datasets

Verified they already match this pattern, OR regenerated them

Ran the full test suite to catch any broken samples

Before merge, demonstrate that this change doesn't break existing datasets:

# Test existing datasets still validate python -m pytest tests/test_std_to_sft_conversion.py -v python -m pytest tests/test_sft_quality_control.py -v

If any existing sample_sft.json files now fail validation, they need to be regenerated and the changes reviewed.

github-actions · 2026-05-15T01:57:08Z

+    if not text:
+        return {}
+    try:
+        return json.loads(text)


🟡 Suggestion - Error Observability: Consider logging JSON parse failures for debugging. When conversion fails silently, it's hard to diagnose data quality issues later:

try: return json.loads(text) except json.JSONDecodeError as e: import sys print(f"Warning: Failed to parse arguments as JSON: {text[:100]}... Error: {e}", file=sys.stderr) return text

This helps maintainers spot malformed tool arguments in the source data.

github-actions · 2026-05-15T01:57:08Z

+    except urllib.error.HTTPError as exc:
+        if exc.code in {401, 403}:
+            raise RuntimeError(
+                "MiroVerse-v0.1 is gated on Hugging Face. Accept the dataset terms and "


🟡 Suggestion - Error Message Clarity: Make the error message more actionable by specifying what failed:

raise RuntimeError( f"MiroVerse-v0.1 is gated on Hugging Face (HTTP {exc.code}). " "Accept the dataset terms at https://huggingface.co/datasets/miromind-ai/MiroVerse-v0.1 " "and provide an authorized HF_TOKEN, or set MIROVERSE_SOURCE_DATASET and " "MIROVERSE_FLAT_LAYOUT for a mirror with the same JSONL files." ) from exc

The HTTP status code helps users distinguish between "not authorized yet" (403) vs "token invalid" (401).

github-actions

Taste Rating: 🟡 Acceptable - Solid dataset work with excellent documentation and comprehensive design decisions. The implementation follows repository conventions, samples are reproducible, and evidence is provided. However, there is one critical cross-dataset verification concern that must be addressed before merge.

[CRITICAL ISSUES]

[agents/openhands/std_to_sft.py] Breaking change to shared converter needs cross-dataset verification (see inline comment)

[IMPROVEMENT OPPORTUNITIES]

[datasets/miroverse_v0_1/raw_to_standardized.py, Line 30] Silent JSON parse failures make debugging difficult
[datasets/miroverse_v0_1/extract_raw.py, Line 59] Error message could be more specific
[datasets/miroverse_v0_1/raw_to_standardized.py, Line 78] Magic behavior needs docstring

[RISK ASSESSMENT]

⚠️ Risk Level: 🟡 MEDIUM

The dataset implementation itself is low risk - it follows all repository conventions, has reproducible samples, comprehensive tests, and excellent documentation. However, the accompanying change to the shared std_to_sft.py converter elevates this to medium risk because it affects all datasets using the OpenHands converter. While the change is fixing a bug (repository guidelines require from: function_call for function calls), and all tests pass, removing code that was rewriting roles requires explicit verification that no existing datasets break. Once cross-dataset verification is provided, the risk drops to 🟢 LOW.

Key risk factors:

Shared converter modification affects multiple datasets
Role rewriting removal could change output format for existing datasets
Test suite passes but doesn't guarantee all dataset samples are in sync

Recommendation: Provide explicit confirmation that existing dataset samples were checked or regenerated before merge. See critical inline comment for details.

VERDICT:

⚠️ Needs verification: The dataset implementation is excellent, but cross-dataset verification is required for the shared converter change.

KEY INSIGHT:

This PR demonstrates exemplary dataset integration practices with comprehensive design-decision documentation and reproducible samples. The only blocker is ensuring the shared converter change doesn't break existing datasets.

Improve this review? If any feedback above seems incorrect or irrelevant to this repository, you can teach the reviewer to do better:

Add a .agents/skills/custom-codereview-guide.md file to your branch (or edit it if one already exists) with the /codereview trigger and the context the reviewer is missing (e.g., "Security concerns about X do not apply here because Y"). See the customization docs for the required frontmatter format.

Re-request a review - the reviewer reads guidelines from the PR branch, so your changes take effect immediately.

When your PR is merged, the guideline file goes through normal code review by repository maintainers.

Resolve with AI? Install the iterate skill in your agent and run /iterate to automatically drive this PR through CI, review, and QA until it's merge-ready.

This review was generated by an AI agent (OpenHands) on behalf of the user.

github-actions · 2026-05-15T02:33:09Z

            traceback.print_exc()
            print(e, file=sys.stderr)
            return None
    if languages:


🔴 Critical - Breaking Change Verification Required:

You removed the code that was converting function_call → gpt and observation → human (lines 304-308 in the old version were deleted here). While this is fixing a bug per repository guidelines (function calls must use from: function_call), this change affects ALL datasets using the OpenHands converter.

Required before merge:

Explicitly confirm that all existing dataset sample_sft.json files have been checked:

Did you regenerate all existing OpenHands SFT samples with this change?

OR do existing datasets already have correct roles and this change is a no-op for them?

OR is there a plan to regenerate affected datasets in a follow-up PR?

Run cross-dataset validation to prove no datasets broke:

python -m pytest tests/test_std_to_sft_*.py -v python -m pytest tests/test_datasets_from_parameter.py -v

Add explicit evidence to the PR description showing the results of step 2.

The PR description mentions "Cross-dataset converter regression evidence" and shows that tests passed, but doesn't explicitly state whether existing dataset samples were regenerated or validated. We need confirmation that:

Either existing datasets already had from: function_call in their sample_std.json and this change doesn't affect their output

Or existing datasets' sample_sft.json files have been regenerated with this change

Or there are no other datasets currently using ApiAction with the OpenHands converter

Why this matters: If any existing dataset samples had from: function_call that were being rewritten to from: gpt, removing this code will change their output format. The test suite validates schema compliance but doesn't catch if sample files are out of sync with what the converters now produce.

github-actions · 2026-05-15T02:33:09Z

+        return {}
+    try:
+        return json.loads(text)
+    except json.JSONDecodeError:


🟡 Suggestion - Silent Failures:

When JSON parsing fails, you silently return the raw text. This makes debugging very difficult when malformed data appears downstream.

Suggested improvement:

except json.JSONDecodeError as e: import sys print(f"Warning: Failed to parse MCP tool arguments as JSON: {text[:100]}...", file=sys.stderr) return text

This preserves your pragmatic fallback behavior while giving future maintainers visibility into data quality issues. The warning will appear when running the converter but won't break the pipeline.

github-actions · 2026-05-15T02:33:09Z

+        return urllib.request.urlopen(request, timeout=120)
+    except urllib.error.HTTPError as exc:
+        if exc.code in {401, 403}:
+            raise RuntimeError(


🟡 Suggestion - Error Message Clarity:

The error message doesn't indicate what specifically failed. Consider being more explicit:

raise RuntimeError( f"MiroVerse-v0.1 is gated on Hugging Face (HTTP {exc.code}). " "Accept the dataset terms at https://huggingface.co/datasets/miromind-ai/MiroVerse-v0.1 " "and provide an authorized HF_TOKEN with read access, or set MIROVERSE_SOURCE_DATASET and " "MIROVERSE_FLAT_LAYOUT for a mirror with the same JSONL files." ) from exc

This helps users understand whether they have an authentication problem (401) vs authorization problem (403) vs something else.

github-actions · 2026-05-15T02:33:09Z

+    return [TextObservation(content=content, source="user")]
+
+
+def _mark_final_answer(content):


🟡 Suggestion - Document Magic Behavior:

The _mark_final_answer function modifies content in-place with implicit wrapping logic. Add a docstring to explain the reasoning:

def _mark_final_answer(content): """Wrap the final assistant message with <finish> tags for ADP compliance. MiroVerse stores plain final answers as assistant messages without explicit terminal actions. This function converts the last MessageAction to include the required <finish> tag per ADP conventions. Modifies content in-place. """

This helps future maintainers understand why this transformation exists and when it's safe to modify or remove.

Co-authored-by: openhands <openhands@all-hands.dev>

github-actions

Taste Rating: 🟡 Acceptable - Solid dataset work with comprehensive documentation and evidence. All critical requirements met.

Summary

This PR successfully adds the MiroVerse v0.1 dataset following repository conventions. The implementation is reproducible, well-documented, and passes all validation tests.

Strengths:

✅ All required files present and correctly structured
✅ Comprehensive evidence section with CI results and conversation link
✅ Design decision catalog thoroughly documents conversion choices
✅ Tests pass including cross-dataset validation (test (3.11) validates shared converter changes)
✅ JSON parse failures are logged to stderr (addresses previous concern)
✅ Sample files validated and reproducible from committed scripts

Previous Review Threads:
The unresolved threads from previous reviews contain valid minor suggestions (error message wording, docstring for _mark_final_answer) but are not blocking. The two main concerns raised previously have been addressed:

Cross-dataset validation of std_to_sft.py changes → Confirmed by passing test suite
Silent JSON parse failures → Already logs warnings to stderr (line 30-33 of raw_to_standardized.py)

Shared Converter Change:
The modification to agents/openhands/std_to_sft.py correctly fixes the role rewriting bug (function calls must use from: function_call per repository guidelines) and adds proper quoting via repr() for dataset-specific API arguments. The passing test suite confirms this doesn't break existing datasets.

This review was generated by an AI agent (OpenHands) on behalf of the repository maintainers.

…iroverse-v0-1

Avoid changing shared OpenHands API-call formatting; encode MiroVerse MCP kwargs as code literals before shared conversion.\n\nCo-authored-by: openhands <openhands@all-hands.dev>

neubig · 2026-05-16T02:44:37Z

I merged current main into this branch and removed the shared std_to_sft.py API argument repr formatting change. MiroVerse now keeps the necessary MCP argument quoting dataset-local in raw_to_standardized.py, and I regenerated sample_std.json / sample_sft.json from the samples.

Focused checks run locally:

python -m ruff check agents/openhands/std_to_sft.py datasets/miroverse_v0_1/raw_to_standardized.py
python -m pytest tests/test_dataset_structure.py tests/test_raw_schemas.py tests/test_standardized_schemas.py tests/test_std_to_sft_conversion.py -q -k 'miroverse or dataset_structure'

The PR checks are green after the cleanup.

This comment was created by an AI agent (OpenHands) on behalf of the user.

neubig

@OpenHands for this dataset we should be making a greater effort to actually extract all of the tools and provide them in the format that is used by tool-calling agents ADP. Find the other datasets that have different tools conditional on the instance, see the method they are using to do this, and adjust the ingest scripts for this dataset. Make sure they work in re-generating the samples.

This isn't ready for approval.

neubig · 2026-05-17T18:45:27Z

@OpenHands for this dataset we should be making a greater effort to actually extract all of the tools and provide them in the format that is used by tool-calling agents ADP. Find the other datasets that have different tools conditional on the instance, see the method they are using to do this, and adjust the ingest scripts for this dataset. Make sure they work in re-generating the samples.

openhands-ai · 2026-05-17T18:46:03Z

I'm on it! neubig can track my progress at all-hands.dev

neubig · 2026-05-17T18:58:13Z

Addressed in 3331977. I updated MiroVerse to parse the per-instance MCP tool inventory from the system prompt into available_tools, convert those declarations into details["available_apis"], and emit direct per-tool ApiAction calls like tool_google_search__scrape(...) instead of only the generic use_mcp_tool wrapper. I also fixed the dynamic available_apis loader so it does not include unrelated typing helper functions in generated tool docs, then regenerated all three sample files.

Validation run:

python -m ruff check agents/openhands/api.py datasets/miroverse_v0_1
PYTHONPATH=$PWD python -m pytest tests/test_dataset_structure.py tests/test_raw_schemas.py tests/test_standardized_schemas.py tests/test_std_to_sft_conversion.py -q -k 'miroverse_v0_1 or dataset_structure'
PYTHONPATH=$PWD python -m pytest tests/test_std_to_sft_action_function.py tests/test_std_to_sft_structure.py tests/test_sft_quality_control.py -q
PYTHONPATH=$PWD python -m pytest tests/test_datasets_from_parameter.py -q
git --no-pager diff --check

This comment was created by an AI agent (OpenHands) on behalf of the user.

Co-authored-by: openhands <openhands@all-hands.dev>

openhands-ai · 2026-05-17T19:01:41Z

Since my last summary, there were no additional code changes beyond completing the PR update and verifying CI.

Final status:

✅ The PR comment request was fully addressed.
✅ The existing PR branch was updated and pushed.
✅ The PR description and PR comment were updated to explain the work.
✅ GitHub CI passed after the final formatting fix:
- Check Docstrings: passed
- Pre-commit Checks: passed
- Python Tests (3.11): passed
✅ Changes remain focused on the requested per-instance tool extraction and necessary dynamic available_apis loader fix.
✅ No extraneous changes are outstanding or need to be reverted.

The final pushed commit is 3331977 on PR #206.

Resolve agents/openhands/api.py conflict against main by taking main's version (#212 removed get_api_tool_description_from_available_tools in favor of the new include_apis filter on get_api_tool_description). Then migrate the MiroVerse converter to the new schema: * raw_to_standardized.py records advertised MCP tool identifiers on the top-level Trajectory.available_apis field (using tool_function_name to join server and tool names) and drops the legacy details['available_apis'] blob. * The unused generate_available_apis import is removed. * api.py is backfilled with stubs (via the existing generate_function_wrapper helper) for every advertised tool that was not already present, so available_apis ⊆ api.py functions. * sample_std.json is regenerated (schema_version 1.1.0) and sample_sft.json is rebuilt with the new pipeline. * README schema-mapping note updated. Co-authored-by: openhands <openhands@all-hands.dev>

generate_function_wrapper emits the docstring via {docstring!r}, which produces single- or double-quoted single-line strings with literal \n escapes — these trip the D300/D301/D400/D415 rules enabled in the new api.py docstring lint workflow (#212). Replace those auto-generated docstrings with the canonical short imperative docstring 'Stub for the advertised MiroVerse MCP tool.' and run pre-commit to ruff-format the file. Lint now passes for datasets/miroverse_v0_1/api.py. Co-authored-by: openhands <openhands@all-hands.dev>

Add MiroVerse v0.1 dataset converter

8e03b1b

Co-authored-by: openhands <openhands@all-hands.dev>

neubig added the review-this Trigger the OpenHands PR review workflow label May 15, 2026 — with OpenHands AI

github-actions Bot requested changes May 15, 2026

View reviewed changes

openhands-agent added 2 commits May 15, 2026 01:53

chore: address CI lint failures (#206)

f56b39c

Co-authored-by: openhands <openhands@all-hands.dev>

chore: narrow CI lint fixes (#206)

d6e3681

Co-authored-by: openhands <openhands@all-hands.dev>

neubig removed the review-this Trigger the OpenHands PR review workflow label May 15, 2026

neubig added the review-this Trigger the OpenHands PR review workflow label May 15, 2026 — with OpenHands AI

github-actions Bot requested changes May 15, 2026

View reviewed changes

neubig removed the review-this Trigger the OpenHands PR review workflow label May 15, 2026

neubig added the review-this Trigger the OpenHands PR review workflow label May 15, 2026 — with OpenHands AI

github-actions Bot requested changes May 15, 2026

View reviewed changes

chore: improve MiroVerse error diagnostics (#206)

f5494fd

Co-authored-by: openhands <openhands@all-hands.dev>

neubig removed the review-this Trigger the OpenHands PR review workflow label May 15, 2026

neubig added the review-this Trigger the OpenHands PR review workflow label May 15, 2026 — with OpenHands AI

github-actions Bot previously approved these changes May 15, 2026

View reviewed changes

openhands-agent added 2 commits May 16, 2026 02:34

Merge remote-tracking branch 'origin/main' into openhands/issue-171-m…

cb67144

…iroverse-v0-1

Keep MiroVerse API literals dataset-local

f30da02

Avoid changing shared OpenHands API-call formatting; encode MiroVerse MCP kwargs as code literals before shared conversion.\n\nCo-authored-by: openhands <openhands@all-hands.dev>

neubig commented May 17, 2026

View reviewed changes

Extract MiroVerse per-instance tools

3331977

Co-authored-by: openhands <openhands@all-hands.dev>

neubig force-pushed the openhands/issue-171-miroverse-v0-1 branch from ff96c83 to 3331977 Compare May 17, 2026 18:59

openhands-agent added 2 commits May 17, 2026 20:20

-    token = os.environ.get("HF_TOKEN")
+        raise RuntimeError(
+            f"MiroVerse-v0.1 is gated on Hugging Face (HTTP {exc.code}). "
+            "Accept the dataset terms at https://huggingface.co/datasets/miromind-ai/MiroVerse-v0.1 "
+            "and provide an authorized HF_TOKEN, or set MIROVERSE_SOURCE_DATASET and "
+            "MIROVERSE_FLAT_LAYOUT for a mirror with the same JSONL files."
+        ) from exc

-        return {}
+    try:
+        return json.loads(text)
+    except json.JSONDecodeError as e:
+        import sys
+        print(f"Warning: Failed to parse arguments as JSON: {e}", file=sys.stderr)
+        return text

		return [TextObservation(content=content, source="user")]


		def _mark_final_answer(content):

Conversation

neubig commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Dataset details

Files added

Schema mapping summary

Design decisions

Tests run

Known limitations

Evidence

Latest CI / validation results

Cross-dataset converter regression evidence

Pipeline / runtime status

Conversation link

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Review Summary

Uh oh!

github-actions Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

[CRITICAL ISSUES]

[IMPROVEMENT OPPORTUNITIES]

[POSITIVE OBSERVATIONS]

[RISK ASSESSMENT]

VERDICT

Uh oh!

github-actions Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Summary

Uh oh!

neubig commented May 16, 2026

Uh oh!

neubig left a comment

Choose a reason for hiding this comment

Uh oh!

neubig commented May 17, 2026

Uh oh!

openhands-ai Bot commented May 17, 2026

Uh oh!

neubig commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openhands-ai Bot commented May 17, 2026

Uh oh!

Reviewers

neubig commented May 14, 2026 •

edited

Loading

neubig commented May 17, 2026 •

edited

Loading