Skip to content

Add MiroVerse v0.1 dataset converter (#171)#206

Open
neubig wants to merge 9 commits into
mainfrom
openhands/issue-171-miroverse-v0-1
Open

Add MiroVerse v0.1 dataset converter (#171)#206
neubig wants to merge 9 commits into
mainfrom
openhands/issue-171-miroverse-v0-1

Conversation

@neubig
Copy link
Copy Markdown
Contributor

@neubig neubig commented May 14, 2026

Closes #171

This PR was created by an AI agent (OpenHands) on behalf of the user.

Summary

  • Adds datasets/miroverse_v0_1 for the SFT portion of miromind-ai/MiroVerse-v0.1.
  • Implements raw extraction from the documented Hugging Face JSONL files, a raw Pydantic schema, MCP tool-call API validation, raw-to-standardized conversion, and generated raw/std/OpenHands SFT samples.
  • Extracts row-specific MCP tool declarations from MiroVerse system prompts, converts them into direct per-tool ADP ApiAction calls plus details["available_apis"], and regenerates samples so tool-calling SFT prompts expose the actual tools instead of only a generic use_mcp_tool wrapper.
  • Updates the OpenHands dynamic available_apis loader to inspect only functions defined by the per-instance API string, avoiding unrelated typing helper functions in generated tool docs.

Dataset details

  • Source: miromind-ai/MiroVerse-v0.1
  • License: hybrid per dataset card; trace data is CC-BY-NC-4.0 while query/answer data retains original source licenses.
  • Size/split: dataset card lists 147,985 SFT train samples across 12 JSONL configs; the issue's HF viewer metadata reports approximately 227,584 rows including additional configurations.
  • Included split/configs: train JSONL SFT configs (MiroVerse-Voyager1.0, MiroVerse-MuSiQue, MiroVerse-HotpotQA, MiroVerse-WebWalkerQA-Silver, MiroVerse-MegaScience, MiroVerse-TaskCraft, MiroVerse-QA-Expert-Multi-Hop-V1.0, MiroVerse-OneGen-TrainDataset-MultiHopQA, MiroVerse-2WikiMultihopQA, MiroVerse-WikiTables, MiroVerse-WebShaper, MiroVerse-WebDancer). DPO files and the zip aggregate are intentionally excluded.

Files added

  • datasets/miroverse_v0_1/README.md
  • datasets/miroverse_v0_1/extract_raw.py
  • datasets/miroverse_v0_1/schema_raw.py
  • datasets/miroverse_v0_1/api.py
  • agents/openhands/api.py (dynamic available_apis filtering fix)
  • datasets/miroverse_v0_1/raw_to_standardized.py
  • datasets/miroverse_v0_1/requirements.txt
  • datasets/miroverse_v0_1/sample_raw.json
  • datasets/miroverse_v0_1/sample_std.json
  • datasets/miroverse_v0_1/sample_sft.json

Schema mapping summary

  • Raw rows are OpenAI-style messages with system, user, and assistant roles plus a split label.
  • extract_raw.py parses the per-row MCP tool inventory from system-prompt JSON-schema blocks into available_tools.
  • system messages are preserved in Trajectory.details["system_prompt"] rather than emitted as conversation turns.
  • raw_to_standardized.py converts available_tools into Trajectory.details["available_apis"], matching the per-instance tool-doc pattern used by other tool-calling datasets.
  • user messages become TextObservation(source="user"), except the user message immediately following a parsed MCP call becomes TextObservation(source="environment") because MiroVerse stores tool results as user-role messages.
  • Assistant <use_mcp_tool>...</use_mcp_tool> blocks become direct per-tool ApiAction calls such as tool_google_search__scrape(...); preceding assistant reasoning is retained as the action description.
  • Other assistant messages become MessageAction; the final assistant response is wrapped as a finish action during standardization.

Design decisions

  • Ambiguity: The source repository is gated on Hugging Face, while validation needs committed sample files. Chosen approach: extract_raw.py defaults to the original dataset and supports HF_TOKEN, but the sample can also be regenerated from an equivalent flat-layout mirror via environment variables. Example: the committed sample was generated with MIROVERSE_SOURCE_DATASET=WaltonFuture/agentic-sft-new MIROVERSE_FLAT_LAYOUT=1 for three same-named MiroVerse JSONL configs because this runtime did not have gated-source access. Alternatives rejected: hand-writing placeholder samples would not be reproducible; committing downloaded full data would be too large.

  • Ambiguity: MiroVerse exposes row-specific MCP tools only inside a long system prompt rather than in a structured column. Chosen approach: parse the ## Server name / ### Tool name / Input JSON schema blocks during extraction, store them as available_tools, and generate details["available_apis"] Python wrappers during standardization. Example: <server_name>tool-google-search</server_name><tool_name>scrape</tool_name> becomes ApiAction(function="tool_google_search__scrape", kwargs={"url": "'https://...'"}) with a matching per-instance function signature in available_apis. Alternatives rejected: keeping only a generic use_mcp_tool hides the actual tool inventory from tool-calling agents; hard-coding one global API file cannot represent tools that vary by instance.

  • Ambiguity: The dynamic available_apis loader seeded its exec namespace with typing helpers, which caused unrelated typing functions to appear as tools. Chosen approach: filter the executed namespace to only functions introduced or overridden by the per-instance API string. Example: generated SFT prompts now list tool_serper_search__google_search and tool_serper_search__scrape without typing.NamedTuple/typing.cast. Alternatives rejected: adding cleanup code to each generated dataset API string would be dataset-local and fragile; leaving the loader unchanged pollutes every dynamic tool prompt.

  • Ambiguity: MiroVerse stores tool results as user messages. Chosen approach: only the user message immediately after a parsed MCP tool call is mapped to source="environment". Example: a browsing-agent result following <use_mcp_tool> becomes an environment observation, while the original question and final-answer summarization prompt remain user observations. Alternatives rejected: mapping all user messages to user would misclassify tool outputs; mapping all post-initial user messages to environment would lose real follow-up prompts.

  • Ambiguity: The raw system prompt is very large and describes MiroVerse's native tool environment. Chosen approach: preserve it in standardized trajectory details, not as a dialogue turn. Example: details["system_prompt"] keeps the original prompt for traceability while SFT starts with the actual user task and ADP tool docs. Alternatives rejected: emitting it as an environment observation creates awkward leading observation turns; dropping it entirely loses provenance.

  • Ambiguity: Assistant MCP XML includes both reasoning and the tool call. Chosen approach: convert the XML block into ApiAction(function="use_mcp_tool") and keep the reasoning as description. Example: an assistant plan followed by <tool_name>search_and_browse</tool_name> becomes one API action with the plan as description. Alternatives rejected: leaving the whole assistant message as plain text loses executable structure; splitting the reasoning into a separate assistant message creates consecutive assistant turns before a tool call.

  • Ambiguity: Plain final answers are not explicit ADP tool calls. Chosen approach: wrap only the last assistant message as <finish> during standardization. Example: \boxed{2011-04-02} becomes a finish action in OpenHands SFT. Alternatives rejected: wrapping all assistant answers would incorrectly turn intermediate answers into terminal states.

  • Ambiguity: The shared OpenHands converter was rewriting generated function_call and observation roles. Chosen approach: keep those roles and quote dataset-specific API arguments in generated execution code. Example: MiroVerse use_mcp_tool(server_name='browsing-agent', ...) is emitted under from: function_call. Alternatives rejected: hand-patching sample roles would not be reproducible; leaving function-call syntax under gpt fails the repository's role convention.

Tests run

  • PYTHONPATH=$PWD python -m pytest tests/test_dataset_structure.py tests/test_raw_schemas.py tests/test_standardized_schemas.py tests/test_std_to_sft_conversion.py -k 'miroverse_v0_1 or test_dataset_structure' -v
  • PYTHONPATH=$PWD python -m pytest tests/test_std_to_sft_action_function.py tests/test_std_to_sft_structure.py tests/test_sft_quality_control.py -v
  • PYTHONPATH=$PWD python -m pytest tests/test_datasets_from_parameter.py -v
  • python -m ruff check agents/openhands/std_to_sft.py datasets/miroverse_v0_1
  • git --no-pager diff --check

Additional validation after per-instance tool extraction update:

  • python -m ruff check agents/openhands/api.py datasets/miroverse_v0_1
  • PYTHONPATH=$PWD python -m pytest tests/test_dataset_structure.py tests/test_raw_schemas.py tests/test_standardized_schemas.py tests/test_std_to_sft_conversion.py -q -k 'miroverse_v0_1 or dataset_structure'
  • PYTHONPATH=$PWD python -m pytest tests/test_std_to_sft_action_function.py tests/test_std_to_sft_structure.py tests/test_sft_quality_control.py -q
  • PYTHONPATH=$PWD python -m pytest tests/test_datasets_from_parameter.py -q
  • git --no-pager diff --check

Known limitations

  • The default source dataset is gated. Full extraction requires accepting the Hugging Face dataset terms and providing an authorized HF_TOKEN.
  • Installing the entire repository requirements.txt in this Python 3.13 runtime failed because browsergym-core pins playwright==1.44, whose greenlet==3.0.3 dependency does not build on Python 3.13. I installed the minimal packages needed for validation individually and ran the tests above.

Evidence

Latest CI / validation results

Validation passed on head SHA d6e3681bb57e887bf61975125475b6f9789c6ac2:

Cross-dataset converter regression evidence

The successful test (3.11) workflow runs pytest tests/test_*.py for the repository. It covers dataset structure/schema checks plus the shared OpenHands SFT converter paths, including:

tests/test_datasets_from_parameter.py
tests/test_sft_quality_control.py
tests/test_std_to_sft_action_function.py
tests/test_std_to_sft_conversion.py
tests/test_std_to_sft_from_parameter_simple.py
tests/test_std_to_sft_structure.py

This provides regression coverage for the shared agents/openhands/std_to_sft.py fix that preserves ADP-compliant from: function_call values rather than rewriting them to gpt.

Pipeline / runtime status

The committed sample artifacts are validated by the green CI suite above on the current PR head. For dataset PRs, the raw, standardized, and OpenHands SFT samples are covered by the dataset structure, raw schema, standardized schema, and SFT conversion tests in that run.

Conversation link

https://app.all-hands.dev/conversations/248118d2-5d98-47e8-ba10-df4233affe87

Evidence update added by an AI agent (OpenHands) on behalf of the user.

Co-authored-by: openhands <openhands@all-hands.dev>
@neubig neubig added the review-this Trigger the OpenHands PR review workflow label May 15, 2026 — with OpenHands AI
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

This PR adds the MiroVerse v0.1 dataset with excellent documentation and a comprehensive design-decision catalog. However, there are critical issues that must be addressed before merge:

  1. Missing Evidence: The PR description lacks an Evidence section showing end-to-end pipeline execution
  2. Undocumented Breaking Change: The std_to_sft.py modification affects ALL datasets but isn't mentioned in the description
  3. Verification Gap: No confirmation that existing datasets still validate after the shared converter change

The dataset implementation itself follows ADP guidelines well, but the shared converter change needs proper documentation and impact verification.


This review was generated by an AI agent (OpenHands) on behalf of the code review process.

m["from"] = "human"
return {
"id": trajectory.id,
"conversations": conversations,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Critical - Undocumented Breaking Change:

You've removed the code that was converting function_callgpt and observationhuman. While this is actually fixing a bug (repo guidelines require from: function_call for function calls), this change affects ALL datasets using the OpenHands converter.

Required actions:

  1. Document this breaking change in the PR description
  2. Explain that this fixes role assignment to match ADP guidelines
  3. Verify that existing dataset samples still validate (run pytest tests/test_sft_quality_control.py -v)
  4. If any existing samples now fail, regenerate them in this PR

Without verification that existing datasets aren't broken, this change is too risky to merge.

def _open_hf_file(path):
url = f"https://huggingface.co/datasets/{SOURCE_DATASET}/resolve/main/{_resolve_path(path)}"
headers = {}
token = os.environ.get("HF_TOKEN")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Suggestion - Error Message Clarity:

The error message could be more helpful by specifying the exact issue. Consider:

Suggested change
token = os.environ.get("HF_TOKEN")
raise RuntimeError(
f"MiroVerse-v0.1 is gated on Hugging Face (HTTP {exc.code}). "
"Accept the dataset terms at https://huggingface.co/datasets/miromind-ai/MiroVerse-v0.1 "
"and provide an authorized HF_TOKEN, or set MIROVERSE_SOURCE_DATASET and "
"MIROVERSE_FLAT_LAYOUT for a mirror with the same JSONL files."
) from exc

This helps users understand they need to accept terms first, not just provide a token.


def _parse_arguments(raw_arguments: str | None) -> dict[str, Any] | str:
if raw_arguments is None:
return {}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Important - Error Handling:

If JSON parsing fails, returning the raw string is good, but you should log or track this for debugging. Consider:

Suggested change
return {}
try:
return json.loads(text)
except json.JSONDecodeError as e:
import sys
print(f"Warning: Failed to parse arguments as JSON: {e}", file=sys.stderr)
return text

This helps identify datasets with malformed tool arguments during extraction.

Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: openhands <openhands@all-hands.dev>
@neubig neubig removed the review-this Trigger the OpenHands PR review workflow label May 15, 2026
@neubig neubig added the review-this Trigger the OpenHands PR review workflow label May 15, 2026 — with OpenHands AI
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taste Rating: 🟡 Acceptable - Solid dataset work with excellent design documentation, but missing critical evidence and cross-dataset verification.


[CRITICAL ISSUES]

Missing Evidence Section (PR Description)

The PR description must include an Evidence section showing that the pipeline actually works end-to-end. Per repository guidelines:

  • For dataset conversions, show the actual commands and their output for:

    1. Extracting raw samples: python datasets/miroverse_v0_1/extract_raw.py | head -5
    2. Converting to standardized: cat datasets/miroverse_v0_1/sample_raw.json | python scripts/json_to_jsonl.py | python datasets/miroverse_v0_1/raw_to_standardized.py
    3. Converting to SFT: cat datasets/miroverse_v0_1/sample_std.json | python scripts/json_to_jsonl.py | python agents/openhands/std_to_sft.py --is_web=no --api_env=execute_bash
  • Test output alone (pytest results) is not sufficient - we need proof the actual conversion scripts work

  • If this work came from an agent conversation, include the conversation URL

Breaking Change Verification Gap (see inline comment on agents/openhands/std_to_sft.py)

The removal of the function_callgpt rewriting affects ALL datasets. Need evidence that existing datasets still validate.


[IMPROVEMENT OPPORTUNITIES]

  • Error logging for JSON parse failures (see inline comment)
  • More specific HTTP error messages (see inline comment)

[POSITIVE OBSERVATIONS]

Excellent design-decision catalog - Thoroughly documents ambiguities and rationale
Proper schema mapping - Correctly maps MCP tool calls to ApiAction
Complete required files - All mandatory dataset files present
Handles gated dataset - Environment variable approach for access control is pragmatic


[RISK ASSESSMENT]

⚠️ Risk Level: 🟡 MEDIUM

Key factors:

  • ✅ New dataset addition (low risk to existing functionality)
  • ⚠️ Shared converter change affects all OpenHands SFT samples (medium risk)
  • ✅ Gated source requires manual access (limits reproducibility but documented)
  • ✅ No security-sensitive operations (environment-based extraction only)

Recommendation:

  • Add Evidence section with actual pipeline output
  • Verify the std_to_sft.py change against existing datasets before merge
  • Once verified, this can be safely merged

VERDICT

Needs rework: Add missing evidence and verify cross-dataset impact

KEY INSIGHT: This PR demonstrates excellent dataset contribution practices (thorough documentation, proper schema mapping), but skips the critical reproducibility proof and cross-dataset verification that prevent regressions in a multi-dataset repository.


Improve this review? If any feedback above seems incorrect or irrelevant to this repository, you can teach the reviewer to do better:

  1. Add a .agents/skills/custom-codereview-guide.md file to your branch (or edit it if one already exists) with the /codereview trigger and the context the reviewer is missing. See the customization docs for the required frontmatter format.
  2. Re-request a review - the reviewer reads guidelines from the PR branch, so your changes take effect immediately.
  3. When your PR is merged, the guideline file goes through normal code review by repository maintainers.

Resolve with AI? Install the iterate skill in your agent and run /iterate to automatically drive this PR through CI, review, and QA until it's merge-ready.

m["from"] = "gpt"
if m["from"] == "observation":
m["from"] = "human"
return {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Critical - Breaking Change Verification: You removed the code that was rewriting function_callgpt and observationhuman. While this is fixing a bug per repository guidelines (function calls must use from: function_call), this change affects ALL datasets using the OpenHands converter.

The PR description does not show that you:

  1. Tested this change against existing sample_sft.json files in other datasets
  2. Verified they already match this pattern, OR regenerated them
  3. Ran the full test suite to catch any broken samples

Before merge, demonstrate that this change doesn't break existing datasets:

# Test existing datasets still validate
python -m pytest tests/test_std_to_sft_conversion.py -v
python -m pytest tests/test_sft_quality_control.py -v

If any existing sample_sft.json files now fail validation, they need to be regenerated and the changes reviewed.

if not text:
return {}
try:
return json.loads(text)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Suggestion - Error Observability: Consider logging JSON parse failures for debugging. When conversion fails silently, it's hard to diagnose data quality issues later:

try:
    return json.loads(text)
except json.JSONDecodeError as e:
    import sys
    print(f"Warning: Failed to parse arguments as JSON: {text[:100]}... Error: {e}", 
          file=sys.stderr)
    return text

This helps maintainers spot malformed tool arguments in the source data.

Comment thread datasets/miroverse_v0_1/extract_raw.py Outdated
except urllib.error.HTTPError as exc:
if exc.code in {401, 403}:
raise RuntimeError(
"MiroVerse-v0.1 is gated on Hugging Face. Accept the dataset terms and "
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Suggestion - Error Message Clarity: Make the error message more actionable by specifying what failed:

raise RuntimeError(
    f"MiroVerse-v0.1 is gated on Hugging Face (HTTP {exc.code}). "
    "Accept the dataset terms at https://huggingface.co/datasets/miromind-ai/MiroVerse-v0.1 "
    "and provide an authorized HF_TOKEN, or set MIROVERSE_SOURCE_DATASET and "
    "MIROVERSE_FLAT_LAYOUT for a mirror with the same JSONL files."
) from exc

The HTTP status code helps users distinguish between "not authorized yet" (403) vs "token invalid" (401).

@neubig neubig removed the review-this Trigger the OpenHands PR review workflow label May 15, 2026
@neubig neubig added the review-this Trigger the OpenHands PR review workflow label May 15, 2026 — with OpenHands AI
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taste Rating: 🟡 Acceptable - Solid dataset work with excellent documentation and comprehensive design decisions. The implementation follows repository conventions, samples are reproducible, and evidence is provided. However, there is one critical cross-dataset verification concern that must be addressed before merge.


[CRITICAL ISSUES]

  • [agents/openhands/std_to_sft.py] Breaking change to shared converter needs cross-dataset verification (see inline comment)

[IMPROVEMENT OPPORTUNITIES]

  • [datasets/miroverse_v0_1/raw_to_standardized.py, Line 30] Silent JSON parse failures make debugging difficult
  • [datasets/miroverse_v0_1/extract_raw.py, Line 59] Error message could be more specific
  • [datasets/miroverse_v0_1/raw_to_standardized.py, Line 78] Magic behavior needs docstring

[RISK ASSESSMENT]

⚠️ Risk Level: 🟡 MEDIUM

The dataset implementation itself is low risk - it follows all repository conventions, has reproducible samples, comprehensive tests, and excellent documentation. However, the accompanying change to the shared std_to_sft.py converter elevates this to medium risk because it affects all datasets using the OpenHands converter. While the change is fixing a bug (repository guidelines require from: function_call for function calls), and all tests pass, removing code that was rewriting roles requires explicit verification that no existing datasets break. Once cross-dataset verification is provided, the risk drops to 🟢 LOW.

Key risk factors:

  1. Shared converter modification affects multiple datasets
  2. Role rewriting removal could change output format for existing datasets
  3. Test suite passes but doesn't guarantee all dataset samples are in sync

Recommendation: Provide explicit confirmation that existing dataset samples were checked or regenerated before merge. See critical inline comment for details.


VERDICT:

⚠️ Needs verification: The dataset implementation is excellent, but cross-dataset verification is required for the shared converter change.

KEY INSIGHT:

This PR demonstrates exemplary dataset integration practices with comprehensive design-decision documentation and reproducible samples. The only blocker is ensuring the shared converter change doesn't break existing datasets.


Improve this review? If any feedback above seems incorrect or irrelevant to this repository, you can teach the reviewer to do better:

  1. Add a .agents/skills/custom-codereview-guide.md file to your branch (or edit it if one already exists) with the /codereview trigger and the context the reviewer is missing (e.g., "Security concerns about X do not apply here because Y"). See the customization docs for the required frontmatter format.
  2. Re-request a review - the reviewer reads guidelines from the PR branch, so your changes take effect immediately.
  3. When your PR is merged, the guideline file goes through normal code review by repository maintainers.

Resolve with AI? Install the iterate skill in your agent and run /iterate to automatically drive this PR through CI, review, and QA until it's merge-ready.


This review was generated by an AI agent (OpenHands) on behalf of the user.

traceback.print_exc()
print(e, file=sys.stderr)
return None
if languages:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Critical - Breaking Change Verification Required:

You removed the code that was converting function_callgpt and observationhuman (lines 304-308 in the old version were deleted here). While this is fixing a bug per repository guidelines (function calls must use from: function_call), this change affects ALL datasets using the OpenHands converter.

Required before merge:

  1. Explicitly confirm that all existing dataset sample_sft.json files have been checked:

    • Did you regenerate all existing OpenHands SFT samples with this change?
    • OR do existing datasets already have correct roles and this change is a no-op for them?
    • OR is there a plan to regenerate affected datasets in a follow-up PR?
  2. Run cross-dataset validation to prove no datasets broke:

    python -m pytest tests/test_std_to_sft_*.py -v
    python -m pytest tests/test_datasets_from_parameter.py -v
  3. Add explicit evidence to the PR description showing the results of step 2.

The PR description mentions "Cross-dataset converter regression evidence" and shows that tests passed, but doesn't explicitly state whether existing dataset samples were regenerated or validated. We need confirmation that:

  • Either existing datasets already had from: function_call in their sample_std.json and this change doesn't affect their output
  • Or existing datasets' sample_sft.json files have been regenerated with this change
  • Or there are no other datasets currently using ApiAction with the OpenHands converter

Why this matters: If any existing dataset samples had from: function_call that were being rewritten to from: gpt, removing this code will change their output format. The test suite validates schema compliance but doesn't catch if sample files are out of sync with what the converters now produce.

return {}
try:
return json.loads(text)
except json.JSONDecodeError:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Suggestion - Silent Failures:

When JSON parsing fails, you silently return the raw text. This makes debugging very difficult when malformed data appears downstream.

Suggested improvement:

    except json.JSONDecodeError as e:
        import sys
        print(f"Warning: Failed to parse MCP tool arguments as JSON: {text[:100]}...", file=sys.stderr)
        return text

This preserves your pragmatic fallback behavior while giving future maintainers visibility into data quality issues. The warning will appear when running the converter but won't break the pipeline.

return urllib.request.urlopen(request, timeout=120)
except urllib.error.HTTPError as exc:
if exc.code in {401, 403}:
raise RuntimeError(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Suggestion - Error Message Clarity:

The error message doesn't indicate what specifically failed. Consider being more explicit:

            raise RuntimeError(
                f"MiroVerse-v0.1 is gated on Hugging Face (HTTP {exc.code}). "
                "Accept the dataset terms at https://huggingface.co/datasets/miromind-ai/MiroVerse-v0.1 "
                "and provide an authorized HF_TOKEN with read access, or set MIROVERSE_SOURCE_DATASET and "
                "MIROVERSE_FLAT_LAYOUT for a mirror with the same JSONL files."
            ) from exc

This helps users understand whether they have an authentication problem (401) vs authorization problem (403) vs something else.

return [TextObservation(content=content, source="user")]


def _mark_final_answer(content):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Suggestion - Document Magic Behavior:

The _mark_final_answer function modifies content in-place with implicit wrapping logic. Add a docstring to explain the reasoning:

def _mark_final_answer(content):
    """Wrap the final assistant message with <finish> tags for ADP compliance.
    
    MiroVerse stores plain final answers as assistant messages without explicit
    terminal actions. This function converts the last MessageAction to include
    the required <finish> tag per ADP conventions.
    
    Modifies content in-place.
    """

This helps future maintainers understand why this transformation exists and when it's safe to modify or remove.

Co-authored-by: openhands <openhands@all-hands.dev>
@neubig neubig removed the review-this Trigger the OpenHands PR review workflow label May 15, 2026
@neubig neubig added the review-this Trigger the OpenHands PR review workflow label May 15, 2026 — with OpenHands AI
github-actions[bot]
github-actions Bot previously approved these changes May 15, 2026
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taste Rating: 🟡 Acceptable - Solid dataset work with comprehensive documentation and evidence. All critical requirements met.


Summary

This PR successfully adds the MiroVerse v0.1 dataset following repository conventions. The implementation is reproducible, well-documented, and passes all validation tests.

Strengths:

  • ✅ All required files present and correctly structured
  • ✅ Comprehensive evidence section with CI results and conversation link
  • ✅ Design decision catalog thoroughly documents conversion choices
  • ✅ Tests pass including cross-dataset validation (test (3.11) validates shared converter changes)
  • ✅ JSON parse failures are logged to stderr (addresses previous concern)
  • ✅ Sample files validated and reproducible from committed scripts

Previous Review Threads:
The unresolved threads from previous reviews contain valid minor suggestions (error message wording, docstring for _mark_final_answer) but are not blocking. The two main concerns raised previously have been addressed:

  1. Cross-dataset validation of std_to_sft.py changes → Confirmed by passing test suite
  2. Silent JSON parse failures → Already logs warnings to stderr (line 30-33 of raw_to_standardized.py)

Shared Converter Change:
The modification to agents/openhands/std_to_sft.py correctly fixes the role rewriting bug (function calls must use from: function_call per repository guidelines) and adds proper quoting via repr() for dataset-specific API arguments. The passing test suite confirms this doesn't break existing datasets.


This review was generated by an AI agent (OpenHands) on behalf of the repository maintainers.

Avoid changing shared OpenHands API-call formatting; encode MiroVerse MCP kwargs as code literals before shared conversion.\n\nCo-authored-by: openhands <openhands@all-hands.dev>
Copy link
Copy Markdown
Contributor Author

neubig commented May 16, 2026

I merged current main into this branch and removed the shared std_to_sft.py API argument repr formatting change. MiroVerse now keeps the necessary MCP argument quoting dataset-local in raw_to_standardized.py, and I regenerated sample_std.json / sample_sft.json from the samples.

Focused checks run locally:

python -m ruff check agents/openhands/std_to_sft.py datasets/miroverse_v0_1/raw_to_standardized.py
python -m pytest tests/test_dataset_structure.py tests/test_raw_schemas.py tests/test_standardized_schemas.py tests/test_std_to_sft_conversion.py -q -k 'miroverse or dataset_structure'

The PR checks are green after the cleanup.

This comment was created by an AI agent (OpenHands) on behalf of the user.

Copy link
Copy Markdown
Contributor Author

@neubig neubig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@OpenHands for this dataset we should be making a greater effort to actually extract all of the tools and provide them in the format that is used by tool-calling agents ADP. Find the other datasets that have different tools conditional on the instance, see the method they are using to do this, and adjust the ingest scripts for this dataset. Make sure they work in re-generating the samples.

@neubig neubig dismissed github-actions[bot]’s stale review May 17, 2026 18:19

This isn't ready for approval.

@neubig
Copy link
Copy Markdown
Contributor Author

neubig commented May 17, 2026

@OpenHands for this dataset we should be making a greater effort to actually extract all of the tools and provide them in the format that is used by tool-calling agents ADP. Find the other datasets that have different tools conditional on the instance, see the method they are using to do this, and adjust the ingest scripts for this dataset. Make sure they work in re-generating the samples.

@openhands-ai
Copy link
Copy Markdown

openhands-ai Bot commented May 17, 2026

I'm on it! neubig can track my progress at all-hands.dev

Copy link
Copy Markdown
Contributor Author

neubig commented May 17, 2026

Addressed in 3331977. I updated MiroVerse to parse the per-instance MCP tool inventory from the system prompt into available_tools, convert those declarations into details["available_apis"], and emit direct per-tool ApiAction calls like tool_google_search__scrape(...) instead of only the generic use_mcp_tool wrapper. I also fixed the dynamic available_apis loader so it does not include unrelated typing helper functions in generated tool docs, then regenerated all three sample files.

Validation run:

  • python -m ruff check agents/openhands/api.py datasets/miroverse_v0_1
  • PYTHONPATH=$PWD python -m pytest tests/test_dataset_structure.py tests/test_raw_schemas.py tests/test_standardized_schemas.py tests/test_std_to_sft_conversion.py -q -k 'miroverse_v0_1 or dataset_structure'
  • PYTHONPATH=$PWD python -m pytest tests/test_std_to_sft_action_function.py tests/test_std_to_sft_structure.py tests/test_sft_quality_control.py -q
  • PYTHONPATH=$PWD python -m pytest tests/test_datasets_from_parameter.py -q
  • git --no-pager diff --check

This comment was created by an AI agent (OpenHands) on behalf of the user.

Co-authored-by: openhands <openhands@all-hands.dev>
@neubig neubig force-pushed the openhands/issue-171-miroverse-v0-1 branch from ff96c83 to 3331977 Compare May 17, 2026 18:59
@openhands-ai
Copy link
Copy Markdown

openhands-ai Bot commented May 17, 2026

Since my last summary, there were no additional code changes beyond completing the PR update and verifying CI.

Final status:

  • ✅ The PR comment request was fully addressed.
  • ✅ The existing PR branch was updated and pushed.
  • ✅ The PR description and PR comment were updated to explain the work.
  • ✅ GitHub CI passed after the final formatting fix:
    • Check Docstrings: passed
    • Pre-commit Checks: passed
    • Python Tests (3.11): passed
  • ✅ Changes remain focused on the requested per-instance tool extraction and necessary dynamic available_apis loader fix.
  • ✅ No extraneous changes are outstanding or need to be reverted.

The final pushed commit is 3331977 on PR #206.

Resolve agents/openhands/api.py conflict against main by taking main's
version (#212 removed get_api_tool_description_from_available_tools in
favor of the new include_apis filter on get_api_tool_description). Then
migrate the MiroVerse converter to the new schema:

* raw_to_standardized.py records advertised MCP tool identifiers on the
  top-level Trajectory.available_apis field (using tool_function_name to
  join server and tool names) and drops the legacy
  details['available_apis'] blob.
* The unused generate_available_apis import is removed.
* api.py is backfilled with stubs (via the existing
  generate_function_wrapper helper) for every advertised tool that was
  not already present, so available_apis ⊆ api.py functions.
* sample_std.json is regenerated (schema_version 1.1.0) and
  sample_sft.json is rebuilt with the new pipeline.
* README schema-mapping note updated.

Co-authored-by: openhands <openhands@all-hands.dev>
generate_function_wrapper emits the docstring via {docstring!r}, which
produces single- or double-quoted single-line strings with literal \n
escapes — these trip the D300/D301/D400/D415 rules enabled in the new
api.py docstring lint workflow (#212). Replace those auto-generated
docstrings with the canonical short imperative docstring
'Stub for the advertised MiroVerse MCP tool.' and run pre-commit to
ruff-format the file. Lint now passes for
datasets/miroverse_v0_1/api.py.

Co-authored-by: openhands <openhands@all-hands.dev>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

review-this Trigger the OpenHands PR review workflow

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add dataset: miromind-ai/MiroVerse-v0.1

2 participants