Skip to content

Add jupyter-agent dataset converter (#175)#197

Open
neubig wants to merge 6 commits into
mainfrom
openhands/issue-175-jupyter-agent-dataset
Open

Add jupyter-agent dataset converter (#175)#197
neubig wants to merge 6 commits into
mainfrom
openhands/issue-175-jupyter-agent-dataset

Conversation

@neubig
Copy link
Copy Markdown
Contributor

@neubig neubig commented May 14, 2026

Closes #175

This PR was created by an AI agent (OpenHands) on behalf of the user.

Summary

Adds an Agent Data Protocol converter for jupyter-agent/jupyter-agent-dataset, including generated sample files for the raw, standardized, and OpenHands SFT stages.

Dataset Source

Files Added/Updated

  • Added datasets/jupyter-agent-dataset/README.md
  • Added datasets/jupyter-agent-dataset/extract_raw.py
  • Added datasets/jupyter-agent-dataset/schema_raw.py
  • Added datasets/jupyter-agent-dataset/raw_to_standardized.py
  • Added datasets/jupyter-agent-dataset/requirements.txt
  • Added generated sample_raw.json, sample_std.json, and sample_sft.json
  • Updated the shared OpenHands SFT converter to preserve function_call/observation roles and to avoid importing browser dependencies for non-web conversion.

Schema Mapping Summary

  • Raw user messages -> TextObservation(source="user")
  • Raw assistant messages with add_and_execute_jupyter_code_cell -> Python CodeAction, preserving the assistant prose as description
  • Raw tool messages -> TextObservation(source="environment")
  • Raw assistant messages with final_answer -> MessageAction with <finish> ... </finish> so the OpenHands SFT converter produces a finish tool call
  • Metadata such as question, answer, executor_type, files_used, and packages_used is stored in trajectory details

Design Decisions

  • Ambiguity: The source dataset provides two splits, thinking and non_thinking.

    • Chosen approach: Use non_thinking by default for ADP samples.
    • Example: extract_raw.py defaults JUPYTER_AGENT_SPLIT to non_thinking.
    • Alternatives rejected: Sampling thinking would preserve extra thinking-tag formatting that is not needed to validate the core tool-call mapping.
  • Ambiguity: Jupyter execution appears as an OpenAI-style assistant tool call named add_and_execute_jupyter_code_cell.

    • Chosen approach: Convert it to a Python CodeAction so the shared OpenHands converter maps it to execute_ipython_cell.
    • Example: A tool call with arguments.code = "import pandas as pd" becomes CodeAction(language="python", content="import pandas as pd", description=...).
    • Alternatives rejected: Adding a dataset-specific API would duplicate the existing notebook execution path in the shared OpenHands SFT converter.
  • Ambiguity: Final answers are represented as a source tool call named final_answer rather than plain assistant text.

    • Chosen approach: Convert final_answer.answer to a <finish> MessageAction.
    • Example: final_answer(answer="453") becomes <finish> 453 </finish> in standardized data and a finish function call in SFT.
    • Alternatives rejected: Preserving final_answer as a custom ApiAction would require an unnecessary dataset-local API and would not align with OpenHands completion conventions.
  • Ambiguity: The raw original_notebook field can be very large.

    • Chosen approach: Keep the field in raw samples for fidelity, but use MAX_ORIGINAL_NOTEBOOK_CHARS=500000 during sample generation to avoid oversized fixtures.
    • Example: The committed sample keeps three source rows with notebook metadata under that threshold.
    • Alternatives rejected: Dropping original_notebook would make the raw samples less faithful to the source schema.
  • Ambiguity: The shared SFT converter rewrote function_call messages to gpt, which conflicts with the repository convention that function-call syntax must use from: function_call.

    • Chosen approach: Preserve function_call and observation roles in the shared converter output.
    • Example: Jupyter code execution now appears as from: function_call followed by from: observation in sample_sft.json.
    • Alternatives rejected: Hand-editing the generated SFT sample would not be reproducible from the pipeline.

Tests Run

  • python -m ruff check agents/openhands/std_to_sft.py agents/openhands/system_prompt/system.py agents/openhands/system_prompt/tools/__init__.py datasets/jupyter-agent-dataset
  • python -m ruff format --check agents/openhands/std_to_sft.py agents/openhands/system_prompt/system.py agents/openhands/system_prompt/tools/__init__.py datasets/jupyter-agent-dataset
  • python -m pytest tests/test_dataset_structure.py tests/test_raw_schemas.py::test_sample_raw_against_schema[jupyter-agent-dataset] 'tests/test_standardized_schemas.py::test_sample_standardized_against_schema[/workspace/project/agent-data-protocol/datasets/jupyter-agent-dataset/sample_std.json]' tests/test_std_to_sft_conversion.py::test_std_to_sft_conversion[jupyter-agent-dataset] tests/test_datasets_from_parameter.py -q
  • python -m pytest tests/test_std_to_sft_action_function.py tests/test_std_to_sft_from_parameter_simple.py tests/test_std_to_sft_structure.py -q

Known Limitations

  • The full source dataset is large; extract_raw.py streams with the Hugging Face datasets package when available and falls back to the Hugging Face dataset viewer API for small samples.
  • The original Kaggle input files referenced by the trajectories are not redistributed in this repository.

Co-authored-by: openhands <openhands@all-hands.dev>
@neubig neubig added the review-this Trigger the OpenHands PR review workflow label May 15, 2026 — with OpenHands AI
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Good taste - Exemplary dataset addition with comprehensive documentation and justified shared converter improvements.

Strengths:

  • ✅ Complete file structure: all required files present (README, extract_raw.py, raw_to_standardized.py, schema_raw.py, all sample JSONs)
  • ✅ Thorough design decision catalog: 4 decisions documented with ambiguity, chosen approach, examples, and rejected alternatives
  • ✅ Proper schema mapping: final_answer → MessageAction with <finish> tags (no api.py needed), tool → TextObservation(source="environment"), correct observation sources
  • ✅ Justified shared changes: Fixes real bug where function_call/observation roles were incorrectly rewritten to gpt/human; adds lazy-loading of browser dependencies
  • ✅ All validation tests pass: dataset_structure, raw_schemas, standardized_schemas, std_to_sft_conversion, datasets_from_parameter
  • ✅ Deterministic pipeline with reproducible commands shown
  • ✅ No extraneous files (no full_*.json, temp files, or scratch JSON)

Minor documentation note: For AI-generated PRs, consider adding the conversation URL (e.g., https://app.all-hands.dev/conversations/{id}) to help reviewers trace the work.

[RISK ASSESSMENT]
🟢 LOW - New dataset addition with bug fixes to shared converter. Changes improve correctness (role preservation aligns with repository conventions) and robustness (lazy browser loading prevents import errors). Backward compatibility verified via test_datasets_from_parameter.py passing.

VERDICT: ✅ Ready to merge

KEY INSIGHT: The converter changes fix a subtle but important bug where the repository's stated conventions (function_call/observation roles) were being violated by the implementation - excellent alignment of code with design intent.


This review was generated by an AI agent (OpenHands).

Use a precise optional-browser import fallback and keep the role-preservation test compatible with the branch's browsing flag.\n\nCo-authored-by: openhands <openhands@all-hands.dev>
Copy link
Copy Markdown
Contributor Author

neubig commented May 16, 2026

I merged current main into this branch, resolved the shared converter update, and tightened the optional browser import fallback to avoid masking unrelated import errors. I also made the shared role-preservation test compatible with this branch's codeact_enable_browsing call.

The PR checks are green after the cleanup.

This comment was created by an AI agent (OpenHands) on behalf of the user.

The previous tip of this PR carried a copy of the
'make OpenHands browser tools optional' refactor in
agents/openhands/system_prompt/tools/__init__.py,
agents/openhands/system_prompt/system.py,
agents/openhands/std_to_sft.py, and the
tests/test_openhands_sft_role_preservation.py fake. The same diff was
duplicated on #193 (CodeScout). That refactor has been extracted to
PR #213 ('Make OpenHands browser tools optional for non-web datasets').

Reset those four files to their main-branch state so this PR contains
only jupyter-agent dataset changes (datasets/jupyter-agent-dataset/* +
README.md + agents/openhands/DATASETS.md catalog row). Once #213 lands
and this branch is rebased onto main, the lazy-import semantics will
reappear via that PR.

Co-authored-by: openhands <openhands@all-hands.dev>
neubig pushed a commit that referenced this pull request May 18, 2026
The previous tip of this PR carried a copy of the
'make OpenHands browser tools optional' refactor in
agents/openhands/system_prompt/tools/__init__.py,
agents/openhands/system_prompt/system.py,
agents/openhands/std_to_sft.py, and the
tests/test_openhands_sft_role_preservation.py fake. The same diff was
duplicated on #197 (jupyter-agent). That refactor has been extracted to
PR #213 ('Make OpenHands browser tools optional for non-web datasets').

Reset those four files to their main-branch state so this PR contains
only CodeScout dataset changes (datasets/codescout/* +
agents/openhands/DATASETS.md catalog row). Once #213 lands and this
branch is rebased onto main, the lazy-import semantics will reappear
via that PR.

Co-authored-by: openhands <openhands@all-hands.dev>
@neubig
Copy link
Copy Markdown
Contributor Author

neubig commented May 18, 2026

I've extracted the lazy-import refactor that was sharing this PR's diff with #193 into its own PR #213 ("Make OpenHands browser tools optional for non-web datasets"). This PR has been force-pushed to drop agents/openhands/system_prompt/tools/__init__.py, agents/openhands/system_prompt/system.py, agents/openhands/std_to_sft.py, and the tests/test_openhands_sft_role_preservation.py fake — those files are now back at their main-branch state here, and the new diff is dataset-only (plus the agents/openhands/DATASETS.md catalog row).

Please merge #213 first; this branch should then rebase cleanly on main with the lazy-import semantics reappearing via that PR.

This comment was posted by an AI agent (OpenHands) on behalf of the user.

neubig added a commit that referenced this pull request May 18, 2026
* Make OpenHands browser tools optional for non-web datasets

Two changes to the OpenHands agent pipeline let non-web dataset
converters run on machines that do not have browsergym installed:

1. agents/openhands/system_prompt/tools/__init__.py wraps the
   'from .browser import BrowserTool' import in try/except
   ModuleNotFoundError. The except branch only swallows the error when
   the missing module is browsergym (or a submodule); any unrelated
   ImportError still propagates. The BrowserTool name is bound to None
   when browsergym is unavailable.

2. agents/openhands/system_prompt/system.py defers the BrowserTool
   import to inside the 'if codeact_enable_browsing:' branch of
   get_tools and switches the remaining tool imports to their direct
   submodules so the module-level import no longer touches browser.py.

3. agents/openhands/std_to_sft.py lazy-loads
   scripts.html_to_axtree.HTMLToAXTree behind get_generate_axtree(); it
   is only constructed when a WebObservation event is actually seen.
   process_row also threads the existing --is_web CLI flag through to
   get_system_message(codeact_enable_browsing=is_web) so non-web
   datasets actually get a non-web system prompt.

4. tests/test_openhands_sft_role_preservation.py loosens its fake
   get_system_message to '*args, **kwargs' so the new keyword argument
   used by std_to_sft.py does not break the fake.

5. A new regression test tests/test_optional_browser.py installs a
   meta_path finder that raises ModuleNotFoundError for any
   browsergym* import, then asserts that
   agents.openhands.system_prompt.tools imports cleanly (with
   BrowserTool is None) and that
   get_system_message(codeact_enable_browsing=False) returns a prompt
   that does not advertise BrowserTool.

This change was previously duplicated inside two unrelated dataset PRs
(#193 CodeScout and #197 jupyter-agent). Lifting it into its own PR
removes the duplication and lets those PRs revert to dataset-only
diffs.

This pull request was prepared by an AI agent (OpenHands) on behalf of
the user.

Co-authored-by: openhands <openhands@all-hands.dev>

* Skip optional-browser test if litellm is unavailable

The previous version of tests/test_optional_browser.py reloaded
agents.openhands.system_prompt.tools by monkeypatching sys.meta_path
in-process. That fails in CI because the workflow's requirements.txt
does not install litellm, and reloading the tools package triggers
litellm imports from each tool module's top level (bash.py, finish.py,
etc.).

Two changes:

1. Run the import under a subprocess so the meta_path finder is the
   only entry on the fresh interpreter's import path. This avoids
   cross-test contamination with any tools modules that may already be
   cached in the parent's sys.modules.

2. Add a pytest.importorskip('litellm') guard. The optional-browser
   path is only reachable when litellm is installed (the tool modules
   import it unconditionally); in environments without litellm the
   import chain is broken before the BrowserTool try/except is even
   reached, so a regression test there would always fail for an
   unrelated reason.

Co-authored-by: openhands <openhands@all-hands.dev>

* Use PEP 451 find_spec/exec_module in optional-browser test finder

Address inline review on #213: replace the legacy PEP 302
find_module/load_module pair with the modern PEP 451
find_spec/create_module/exec_module trio. The legacy interface is
deprecated since Python 3.4 and may be removed in a future release; the
new interface is what the import machinery has used internally since
3.4 and is forward-compatible. Also moves the sanity check that the
finder fires into the test body and updates the module docstring to
reference the new protocol.

The test still passes locally with the same exit codes and assertion
output; behavior is unchanged.

Co-authored-by: openhands <openhands@all-hands.dev>

---------

Co-authored-by: openhands <openhands@all-hands.dev>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

review-this Trigger the OpenHands PR review workflow

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add dataset: jupyter-agent/jupyter-agent-dataset

2 participants