Portable S3Mem evidence-harness plugin for long-horizon agent trajectories.
This repository packages the reusable core of S3Mem as a trajectory-to-evidence harness rather than a benchmark-specific answerer. The goal is to make S3Mem usable from external agent systems such as OpenClaw or other custom loops that already have their own planner, policy, or LLM answerer.
S3Mem Harness converts a trajectory into structured scene-event memories and returns a compact evidence bundle for a downstream system.
Core stages:
-
Structured write
- Convert each trajectory step into a structured episodic memory unit.
- Parse visible objects, relations, events, actions, location, and inventory.
-
Anchor-sensitive retrieval
- Retrieve candidate memories with lexical and dense-style signals.
- Promote query-aligned anchor steps when structured question metadata is available.
-
Budget-aware evidence packing
- Preserve decisive anchor steps and local neighborhoods.
- Return a compact evidence bundle under a fixed token budget.
-
Downstream consumption
- The harness returns evidence.
- Your planner / answerer / agent can decide how to use it.
This design keeps the plugin portable across agent systems.
This repository is not the full research evaluation codebase. It intentionally excludes benchmark-specific heuristic answerers and training scripts that were tied to Crafter, Jericho, ATM-Bench, and other paper experiments.
Instead, this plugin exposes the reusable memory interface:
- trajectory ingestion
- structured memory serialization
- retrieval
- reranking
- compact evidence packing
s3mem-harness/
├── .codex-plugin/plugin.json
├── INTEGRATION.md
├── examples/
│ ├── openclaw_trajectory.json
│ ├── openclaw_real_trace_excerpt.json
│ ├── openclaw_real_question.json
│ └── openclaw_integration_demo.py
│ └── question.json
├── src/s3mem_harness/
│ ├── __init__.py
│ ├── adapters.py
│ ├── cli.py
│ ├── harness.py
│ ├── retrieval.py
│ └── types.py
└── tests/
└── test_harness.py
Use this when your system can already emit normalized step dictionaries.
Expected step shape:
{
"step_id": 3,
"observation": {
"text": "Moved to the hallway.",
"location": "hallway",
"inventory": ["brass_key"],
"visible_objects": [{"category": "door"}],
"relations": [{"src": "agent", "dst": "door", "relation": "near"}],
"action": "MOVE",
"event": {"event_type": "move", "arguments": {"target": "hallway"}}
}
}Use this when your logs look more like a typical agent runtime trace:
{
"step": 7,
"observation": "You are in the lab.",
"action": "LOOK",
"info": {
"location": "lab",
"inventory": ["badge"],
"objects": ["desk"]
}
}The adapter normalizes common OpenClaw-like fields:
step/step_idobservation/obs/messageactionstate/infoinventoryobjectsrelationslocation
cd s3mem-harness
python -m pip install -e .[dev]from s3mem_harness import S3MemHarness
harness = S3MemHarness()
harness.ingest_trajectory(
steps=my_trajectory,
episode_id="episode_001",
adapter="openclaw", # or "generic"
)
result = harness.query(
{
"question": "What happened one step after obtaining the brass key?",
"metadata": {
"answer_type": "action_after_gain_item",
"item": "brass_key",
"occurrence": "first",
"delta": 1
}
},
mode="s3mem",
top_k=24,
token_budget=768,
)
print(result.bundle.compressed_text)
print(result.bundle.selected_steps)from s3mem_harness import S3MemHarness
harness = S3MemHarness()
harness.ingest_trajectory(steps, episode_id="episode_001", adapter="generic")
harness.save_jsonl("memory.jsonl")
other = S3MemHarness()
other.load_jsonl("memory.jsonl")
result = other.query("Where did the agent go after taking the key?")s3mem-harness one-shot \
--trajectory examples/openclaw_trajectory.json \
--question examples/question.json \
--adapter openclaw \
--mode s3memThe repository also includes a real OpenClaw-compatible excerpt derived from an actual ALFWorld handcoded-expert rollout.
Files:
examples/openclaw_real_trace_excerpt.jsonexamples/openclaw_real_question.json
Command:
s3mem-harness one-shot \
--trajectory examples/openclaw_real_trace_excerpt.json \
--question examples/openclaw_real_question.json \
--adapter openclaw \
--mode s3memThe real excerpt is derived from:
- benchmark:
ALFWorld - rollout source:
real_text_expert_rollout - policy:
handcoded_expert - task family:
pick_two_obj_and_place
This keeps the public sample realistic without depending on the full original evaluation repository.
To see a complete trajectory-ingest -> evidence-bundle -> downstream-prompt flow:
python examples/openclaw_integration_demo.pyThis script demonstrates:
- loading a real OpenClaw-compatible trace excerpt
- ingesting it with
OpenClawTrajectoryAdapter - querying through
S3MemHarness - constructing a downstream prompt payload for an external LLM / planner
s3mem-harness index \
--trajectory examples/openclaw_trajectory.json \
--adapter openclaw \
--output build/memory.jsonls3mem-harness query \
--memory-jsonl build/memory.jsonl \
--question examples/question.json \
--mode s3memThe harness supports three modes:
s3mem- structured retrieval + reranking + budget-aware evidence packing
graph_no_reader- structured memory text without the full S3Mem evidence harness behavior
vanilla_rag- plain summary-based retrieval
These modes are useful for integration tests and apples-to-apples harness comparisons.
The harness returns a QueryResult with:
bundle.compressed_textbundle.selected_stepsbundle.retrieved_stepsbundle.support_objectsbundle.support_relationsbundle.evidence_chain
This makes it easy to wire S3Mem into downstream LLM prompts, planners, tools, or custom answerers.
For a fuller integration walkthrough, see:
INTEGRATION.md
S3Mem in the paper contains both:
- a reusable memory core
- benchmark-specific answer-time logic
For external systems, the reusable part is the memory core. That is what this plugin extracts and packages. The plugin is therefore a harness:
- it does not force a particular planner
- it does not require a specific benchmark
- it does not require the original paper’s heuristic answer layer
This is the correct compatibility boundary for systems such as OpenClaw.
Local tests included in this repository:
- anchor-aligned retrieval on a toy trajectory
- OpenClaw adapter normalization
- CLI one-shot smoke test
Run:
pytest -qThe real excerpt can also be used as a public integration sample for OpenClaw-style logs.
Best fit:
- OpenClaw-like action/observation logs
- custom game agents
- embodied or text agents with step-structured traces
- evaluation harnesses that want compact evidence instead of raw long context
Less ideal fit:
- archive-style document QA with no trajectory semantics
- applications that only need generic long-context summarization
MIT