Make OpenHands browser tools optional for non-web datasets#213
Conversation
Two changes to the OpenHands agent pipeline let non-web dataset converters run on machines that do not have browsergym installed: 1. agents/openhands/system_prompt/tools/__init__.py wraps the 'from .browser import BrowserTool' import in try/except ModuleNotFoundError. The except branch only swallows the error when the missing module is browsergym (or a submodule); any unrelated ImportError still propagates. The BrowserTool name is bound to None when browsergym is unavailable. 2. agents/openhands/system_prompt/system.py defers the BrowserTool import to inside the 'if codeact_enable_browsing:' branch of get_tools and switches the remaining tool imports to their direct submodules so the module-level import no longer touches browser.py. 3. agents/openhands/std_to_sft.py lazy-loads scripts.html_to_axtree.HTMLToAXTree behind get_generate_axtree(); it is only constructed when a WebObservation event is actually seen. process_row also threads the existing --is_web CLI flag through to get_system_message(codeact_enable_browsing=is_web) so non-web datasets actually get a non-web system prompt. 4. tests/test_openhands_sft_role_preservation.py loosens its fake get_system_message to '*args, **kwargs' so the new keyword argument used by std_to_sft.py does not break the fake. 5. A new regression test tests/test_optional_browser.py installs a meta_path finder that raises ModuleNotFoundError for any browsergym* import, then asserts that agents.openhands.system_prompt.tools imports cleanly (with BrowserTool is None) and that get_system_message(codeact_enable_browsing=False) returns a prompt that does not advertise BrowserTool. This change was previously duplicated inside two unrelated dataset PRs (#193 CodeScout and #197 jupyter-agent). Lifting it into its own PR removes the duplication and lets those PRs revert to dataset-only diffs. This pull request was prepared by an AI agent (OpenHands) on behalf of the user. Co-authored-by: openhands <openhands@all-hands.dev>
The previous tip of this PR carried a copy of the 'make OpenHands browser tools optional' refactor in agents/openhands/system_prompt/tools/__init__.py, agents/openhands/system_prompt/system.py, agents/openhands/std_to_sft.py, and the tests/test_openhands_sft_role_preservation.py fake. The same diff was duplicated on #193 (CodeScout). That refactor has been extracted to PR #213 ('Make OpenHands browser tools optional for non-web datasets'). Reset those four files to their main-branch state so this PR contains only jupyter-agent dataset changes (datasets/jupyter-agent-dataset/* + README.md + agents/openhands/DATASETS.md catalog row). Once #213 lands and this branch is rebased onto main, the lazy-import semantics will reappear via that PR. Co-authored-by: openhands <openhands@all-hands.dev>
The previous tip of this PR carried a copy of the 'make OpenHands browser tools optional' refactor in agents/openhands/system_prompt/tools/__init__.py, agents/openhands/system_prompt/system.py, agents/openhands/std_to_sft.py, and the tests/test_openhands_sft_role_preservation.py fake. The same diff was duplicated on #197 (jupyter-agent). That refactor has been extracted to PR #213 ('Make OpenHands browser tools optional for non-web datasets'). Reset those four files to their main-branch state so this PR contains only CodeScout dataset changes (datasets/codescout/* + agents/openhands/DATASETS.md catalog row). Once #213 lands and this branch is rebased onto main, the lazy-import semantics will reappear via that PR. Co-authored-by: openhands <openhands@all-hands.dev>
| return self | ||
| return None | ||
|
|
||
| def load_module(self, name): # pragma: no cover - exercised via import machinery |
There was a problem hiding this comment.
🟡 Suggestion: load_module IS exercised by the import machinery when find_module returns self — that's exactly the path that raises ModuleNotFoundError and exercises the try/except in __init__.py. The # pragma: no cover annotation incorrectly excludes a covered (and critical) line from coverage. Remove it.
| def load_module(self, name): # pragma: no cover - exercised via import machinery | |
| def load_module(self, name): |
| generate_axtree = None | ||
|
|
||
|
|
||
| def get_generate_axtree(): |
There was a problem hiding this comment.
🟠 Important (PR description): The PR description's Validation section only shows pytest output. Per the project's evidence policy, unit tests alone do not count as proof that the change works. Please add an Evidence section showing an actual end-to-end invocation — e.g. running std_to_sft.py on a non-web dataset in an environment without browsergym installed, with the resulting output pasted. A link to the originating OpenHands conversation (https://app.all-hands.dev/conversations/{id}) would also satisfy this requirement.
The previous version of tests/test_optional_browser.py reloaded
agents.openhands.system_prompt.tools by monkeypatching sys.meta_path
in-process. That fails in CI because the workflow's requirements.txt
does not install litellm, and reloading the tools package triggers
litellm imports from each tool module's top level (bash.py, finish.py,
etc.).
Two changes:
1. Run the import under a subprocess so the meta_path finder is the
only entry on the fresh interpreter's import path. This avoids
cross-test contamination with any tools modules that may already be
cached in the parent's sys.modules.
2. Add a pytest.importorskip('litellm') guard. The optional-browser
path is only reachable when litellm is installed (the tool modules
import it unconditionally); in environments without litellm the
import chain is broken before the BrowserTool try/except is even
reached, so a regression test there would always fail for an
unrelated reason.
Co-authored-by: openhands <openhands@all-hands.dev>
|
Addressed both review comments: 1. 2. Evidence section per project policy — added a full Please re-review. This comment was posted by an AI agent (OpenHands) on behalf of the user. |
There was a problem hiding this comment.
🟡 Acceptable — The lazy-import refactor is clean and correct. The unresolved evidence thread on std_to_sft.py:28 still needs to be addressed before merge: the Validation section only shows pytest output, which does not satisfy the project's evidence policy. Please add an Evidence section with an actual end-to-end invocation (e.g. running std_to_sft.py --is_web=no through the pipeline on a non-web dataset) and its output.
One minor new note on the test blocker below.
This review was generated by an AI agent (OpenHands) on behalf of the user.
| def find_module(self, name, path=None): | ||
| if name.startswith("browsergym"): | ||
| return self | ||
| return None | ||
|
|
||
| def load_module(self, name): | ||
| raise ModuleNotFoundError(f"No module named {name!r}", name=name) |
There was a problem hiding this comment.
🟡 Suggestion: _BlockBrowserGym uses the deprecated PEP 302 find_module/load_module interface (deprecated since Python 3.4). It still works in Python 3.12, but the modern find_spec/exec_module protocol is forward-compatible with future Python versions:
def find_spec(self, fullname, path, target=None):
if fullname.startswith("browsergym"):
import importlib.machinery
return importlib.machinery.ModuleSpec(fullname, self)
return None
def create_module(self, spec):
return None
def exec_module(self, module):
raise ModuleNotFoundError(f"No module named {module.__name__!r}", name=module.__name__)Address inline review on #213: replace the legacy PEP 302 find_module/load_module pair with the modern PEP 451 find_spec/create_module/exec_module trio. The legacy interface is deprecated since Python 3.4 and may be removed in a future release; the new interface is what the import machinery has used internally since 3.4 and is forward-compatible. Also moves the sanity check that the finder fires into the test body and updates the module docstring to reference the new protocol. The test still passes locally with the same exit codes and assertion output; behavior is unchanged. Co-authored-by: openhands <openhands@all-hands.dev>
|
Addressed the new review: PEP 451 modernization ( Evidence (PR description) — note that the Re-running the driver locally with the updated PEP 451 finder produces the same output: Re-applying This comment was posted by an AI agent (OpenHands) on behalf of the user. |
|
Addressed both points from the latest review:
Re-running the modernized driver locally: This comment was posted by an AI agent (OpenHands) on behalf of the user. |
There was a problem hiding this comment.
🟢 Good taste — All three previously unresolved threads are resolved in the current code:
- PEP 451 protocol:
test_optional_browser.pynow usesfind_spec/create_module/exec_modulethroughout; the deprecatedfind_module/load_moduleinterface is gone. # pragma: no cover: No such annotation exists in the current file.- Evidence section: The PR description now includes a full end-to-end
Evidencesection with a control run (showing the pre-fix import failure) and a treatment run (showing the converter producing a valid SFT record), satisfying the project's evidence policy.
The lazy-import refactor is clean and correct. No new issues found.
This review was generated by an AI agent (OpenHands) on behalf of the user.
Brings in the shared OpenHands SFT converter changes (lazy-load browser tools, preserve function_call/observation roles) from PR #213, which this branch previously depended on. Co-authored-by: openhands <openhands@all-hands.dev>
* Add jupyter-agent dataset converter Co-authored-by: openhands <openhands@all-hands.dev> * Tighten browser import fallback for jupyter-agent Use a precise optional-browser import fallback and keep the role-preservation test compatible with the branch's browsing flag.\n\nCo-authored-by: openhands <openhands@all-hands.dev> * Merge main, regenerate sample_std.json with schema_version 1.1.0 Co-authored-by: openhands <openhands@all-hands.dev> * Drop shared lazy-import refactor; depend on #213 The previous tip of this PR carried a copy of the 'make OpenHands browser tools optional' refactor in agents/openhands/system_prompt/tools/__init__.py, agents/openhands/system_prompt/system.py, agents/openhands/std_to_sft.py, and the tests/test_openhands_sft_role_preservation.py fake. The same diff was duplicated on #193 (CodeScout). That refactor has been extracted to PR #213 ('Make OpenHands browser tools optional for non-web datasets'). Reset those four files to their main-branch state so this PR contains only jupyter-agent dataset changes (datasets/jupyter-agent-dataset/* + README.md + agents/openhands/DATASETS.md catalog row). Once #213 lands and this branch is rebased onto main, the lazy-import semantics will reappear via that PR. Co-authored-by: openhands <openhands@all-hands.dev> --------- Co-authored-by: openhands <openhands@all-hands.dev>
* Integrate CodeScout dataset Co-authored-by: openhands <openhands@all-hands.dev> * Tighten optional browser import handling Only suppress missing browsergym dependencies when browser tools are unavailable, and keep the role-preservation test compatible with the branch's browsing flag.\n\nCo-authored-by: openhands <openhands@all-hands.dev> * Merge main, regenerate sample_std.json with schema_version 1.1.0 Co-authored-by: openhands <openhands@all-hands.dev> * Drop shared lazy-import refactor; depend on #213 The previous tip of this PR carried a copy of the 'make OpenHands browser tools optional' refactor in agents/openhands/system_prompt/tools/__init__.py, agents/openhands/system_prompt/system.py, agents/openhands/std_to_sft.py, and the tests/test_openhands_sft_role_preservation.py fake. The same diff was duplicated on #197 (jupyter-agent). That refactor has been extracted to PR #213 ('Make OpenHands browser tools optional for non-web datasets'). Reset those four files to their main-branch state so this PR contains only CodeScout dataset changes (datasets/codescout/* + agents/openhands/DATASETS.md catalog row). Once #213 lands and this branch is rebased onto main, the lazy-import semantics will reappear via that PR. Co-authored-by: openhands <openhands@all-hands.dev> * codescout: emit reward on final action, drop tools/reward_dict from details Address review feedback on PR #193: 1. The per-trajectory reward signal is now attached to the agent's final action (the localization_finish MessageAction) as content[-1].reward using the upstream multilevel_localization_f1_reward. CodeScout localization is single-step from a reward perspective: the agent inspects the repo, then emits one localization_finish action; that action's reward is the F1 score across file/module/entity granularity. The component sub-rewards (file/module/entity/multiturn) are not propagated since they aren't the canonical reward signal the standardized schema needs to carry. Care is taken to attach the reward to the last Action (walking content from the end), not to the last content item, because some raw rollouts include a trailing tool-response observation that echoes the locations. The reward semantically belongs on the agent's action that produced the locations, not on the environment's echo. 2. details no longer contains 'tools' (the per-instance OpenAI tool schema is dataset-wide infrastructure, not a per-trajectory detail) or 'reward_dict' (subsumed by the per-step reward field above). The trajectory's details dict now contains only fields that vary by trajectory: source_dataset, source_config, source_split, row_id, and (when present) instance_id, step, rollout_number. 3. Regenerated sample_std.json and sample_sft.json. Updated README to document the reward mapping. Co-authored-by: openhands <openhands@all-hands.dev> * codescout: drop unnecessary HF datasets import workaround The repo root has no datasets/__init__.py, so 'from datasets import load_dataset' resolves to the installed HuggingFace datasets package even with the repo root on PYTHONPATH (namespace-package fallthrough). Every other dataset extractor uses the plain form; codescout was diverging unnecessarily. Co-authored-by: openhands <openhands@all-hands.dev> * codescout: fix tool-call/observation pairing when id is None The storage side used `tool_call_id or f"tool_{index}"` as the fallback key while the lookup used `tool_call.id or ""`. When an upstream message lacked a tool_call_id, storage created a unique positional key while the lookup queried "", silently missing the match. The observation then fell through to the unmatched-fallback loop and appeared after all paired action/observation pairs instead of immediately after its producing action. Aligning both sides on "" lets a single None-id pair still pair in place (the only case observed). The committed CodeScout samples always carry tool_call ids, so sample_std.json regenerates byte- identically; this only affects future or pathological data. Verified with a constructed None-id record: observation now appears immediately after its code_action, not at the end of the trajectory. Co-authored-by: openhands <openhands@all-hands.dev> * codescout: drop trailing tool-response echo after localization_finish Real OpenHands sessions terminate immediately on finish, but the raw rollouts include a tool-response message after localization_finish that just repeats the submitted locations. Dropping that echo aligns the standardized trajectory with real agent behavior and prevents SFT generation from emitting a phantom observation turn after the finish action. The locations are already captured in the finish MessageAction content, so no information is lost. Updated README to document the design decision. Regenerated sample_std.json and sample_sft.json; sample_std.json shrinks by one TextObservation per trajectory that had a finish-echo pair (2 of 3 sample trajectories). Co-authored-by: openhands <openhands@all-hands.dev> * codescout: document orphaned-tool-message defensive branch The 'tool' message branch in the top-level loop only fires for malformed inputs where a tool message is not immediately preceded by an assistant message with tool_calls. Well-formed CodeScout rollouts pre-consume tool messages via tool_observations() in the assistant branch and skip past them with index = next_index. Added a comment explaining what cases this branch is meant to cover so future maintainers don't think it's dead code. Co-authored-by: openhands <openhands@all-hands.dev> * codescout: warn on tool_call_id key collision If multiple tool messages share the same tool_call_id (or all share the None fallback key ""), the OrderedDict storage loses all but the last and the lookup then incorrectly pairs the surviving observation with every same-keyed tool_call. Not observed in either CodeScout source, but the warning ensures the latent bug surfaces loudly if it ever appears in the full 58.9K extraction. Co-authored-by: openhands <openhands@all-hands.dev> * codescout: allow null "arguments" in tool_call.function Some raw rows have `"arguments": null` instead of an empty string, which previously raised a Pydantic ValidationError and silently dropped those trajectories from the full extraction. The downstream `parse_arguments()` already treats None as an empty arguments object, so making the field Optional[str] = None is the minimal fix. Co-authored-by: openhands <openhands@all-hands.dev> --------- Co-authored-by: openhands <openhands@all-hands.dev>
Summary
Extract the lazy-import refactor that was previously duplicated inside PRs #193 (CodeScout) and #197 (jupyter-agent) into its own change so those PRs can revert to dataset-only diffs.
Motivation
Non-web datasets (CodeScout, jupyter-agent, etc.) currently cannot run
agents/openhands/std_to_sft.pyon environments that do not havebrowsergyminstalled, because:agents/openhands/system_prompt/tools/__init__.pydoes an unconditionalfrom .browser import BrowserToolandbrowser.pyimportsbrowsergymat module load.agents/openhands/std_to_sft.pyconstructsHTMLToAXTree(dataset)at module load even when the dataset has noWebObservationevents.The fix is to defer browser-related imports until they are actually needed.
Changes
agents/openhands/system_prompt/tools/__init__.py— Wrap theBrowserToolre-export intry/except ModuleNotFoundError. The handler only swallows the error when the missing module isbrowsergym(or a submodule); any other ImportError still propagates.BrowserToolis bound toNonewhenbrowsergymis unavailable.agents/openhands/system_prompt/system.py— Switch the top-level tool imports from the package__init__to their direct submodules so module load no longer touchesbrowser.py. Deferfrom agents.openhands.system_prompt.tools.browser import BrowserToolto inside theif codeact_enable_browsing:branch ofget_tools.agents/openhands/std_to_sft.py— Lazy-loadscripts.html_to_axtree.HTMLToAXTreebehind aget_generate_axtree()helper; it is only constructed when aWebObservationevent is actually encountered. Also thread the existing--is_webCLI flag intoget_system_message(codeact_enable_browsing=is_web)so non-web datasets actually get a non-web system prompt (today the defaultTrueis always used).tests/test_openhands_sft_role_preservation.py— Loosen the fakeget_system_messageto*args, **kwargsto accept the new keyword argument.tests/test_optional_browser.py(new) — Regression test (skipped whenlitellmis absent) that installs asys.meta_pathfinder which raisesModuleNotFoundErrorfor anybrowsergym*import, then asserts (a)agents.openhands.system_prompt.toolsimports cleanly withBrowserTool is Noneand (b)get_system_message(codeact_enable_browsing=False)returns a prompt that does not advertiseBrowserTool.Validation
python -m pytest tests/→ 183 passed, 12 skipped, 4 warnings.Evidence — end-to-end conversion of a non-web dataset without browsergym
Driver script (full source below): installs a
sys.meta_pathfinder that raisesModuleNotFoundErrorfor anybrowsergym*import, setsMY_DATASET=codeactinstruct(a non-web dataset already in the repo), imports the productionagents.openhands.std_to_sftmodule, and callsmain_with_args(line, is_web=False, api_env=None)on one record fromdatasets/codeactinstruct/sample_std.json.Control run — on
main(without this PR)→ The pipeline fails to import; converter is unusable.
Treatment run — on this branch (
refactor-optional-browser)→ The pipeline runs to completion, returns a well-formed SFT record (verified via
json.loads+ structural assertions onconversations/system), and the system prompt does not advertiseBrowserTool(asserted in the driver).Driver script
Follow-up
Once this is merged, PRs #193 and #197 will be rebased onto
mainto drop their copies of these four files; their diffs should then contain only their respectivedatasets/directory plus theREADME.md/agents/openhands/DATASETS.mdcatalog entries (already done — see #197 and #193).This PR was prepared by an AI agent (OpenHands) on behalf of the user. Originating conversation context is available to the requester.