Skip to content

feat(evaluators): add built-in budget evaluator for per-agent cost tracking#144

Open
amabito wants to merge 3 commits intoagentcontrol:mainfrom
amabito:feat/budget-evaluator
Open

feat(evaluators): add built-in budget evaluator for per-agent cost tracking#144
amabito wants to merge 3 commits intoagentcontrol:mainfrom
amabito:feat/budget-evaluator

Conversation

@amabito
Copy link
Copy Markdown

@amabito amabito commented Mar 21, 2026

Summary

Scope

  • User-facing/API changes:

    • "budget" evaluator registered alongside regex/list/json/sql
    • BudgetStore protocol + InMemoryBudgetStore (dict + threading.Lock)
    • BudgetEvaluatorConfig: limits list, optional pricing table, path configs
  • Internal changes:

    • evaluators/builtin/src/agent_control_evaluators/budget/ -- 4 files, ~650 LOC
    • evaluators/builtin/tests/budget/test_budget.py -- 63 tests
  • Out of scope:

    • No Postgres store, no DB tables, no new dependencies

Risk and Rollout

  • Risk level: low -- new evaluator, no changes to existing code. 230 existing tests untouched.
  • Rollback plan: revert PR

Testing

  • Added or updated automated tests (63 tests incl. thread safety, NaN/Inf, scope injection, double-count)
  • Ran pytest tests/ -- 293 passed
  • Manually verified behavior

Checklist

Copy link
Copy Markdown
Contributor

@lan17 lan17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this! It looks overall in the right direction to me, but I think we can improve the design a bit to make it more universal.

Could you also implement this as a contrib evaluator so it doesn't become a built-in evaluator right away? We will then put it to use and once its in a production ready stage we can move to builtin.

scope: dict[str, str] = Field(default_factory=dict)
per: str | None = None
window: Literal["daily", "weekly", "monthly"] | None = None
limit_usd: float | None = None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this is a wrong abstraction for budgeting. What if the budget is in different currency? Why do we need separate fields for tokens vs usd?

It might be better to do something like an integer for a limit and then define "currency" Enum which could be USD, tokens, Euros, etc.,

I don't think there's a use case for having floating point for USD or Euros, no?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I'll switch to an integer limit + a Currency enum (USD, EUR, tokens, etc.). Float was unnecessary — cents-level precision is handled by using integer cents anyway.


scope: dict[str, str] = Field(default_factory=dict)
per: str | None = None
window: Literal["daily", "weekly", "monthly"] | None = None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we want hourly or half-an-hour?

Maybe its better to define "window" as an integer in seconds, or minutes? That way you can express whatever window you want.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Will change to window_seconds: int. I'll keep a few named constants (DAILY = 86400 etc.) as convenience but the field itself will be raw seconds.

on the first breach.

Attributes:
scope: Static scope dimensions that must match for this rule
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth it to give some examples here for scope?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will add. Something like {"agent": "summarizer", "channel": "slack"}.

limits: list[BudgetLimitRule] = Field(min_length=1)
pricing: dict[str, dict[str, float]] | None = None
token_path: str | None = None
cost_path: str | None = None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this evaluator be computing cost in USD based on model and token count?

Doesn't seem like an LLM step should be passing it down here. I maybe wrong on this.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking was: the caller already has cost from the LLM response, so passing it avoids maintaining a pricing table. But I see the argument — if the evaluator owns cost computation, the contract is simpler and the caller can't lie about cost. One question: should the evaluator maintain its own pricing table, or pull from an external source (e.g. LiteLLM's model cost map)?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe you already include a pricing table in the evaluator config, so for version 1 of this evaluator we can just rely on that?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I'll have the evaluator compute cost from the config pricing table + model + token counts. That lets me drop cost_path entirely — just need token_path and a new model_path to extract the model name from the input. If the model isn't in the pricing table, fail closed.

return max(ratios) if ratios else 0.0


class InMemoryBudgetStore:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This should go into a separate file so that interface for store and its implementation are separate.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will split into store.py (protocol) and memory_store.py (InMemoryBudgetStore).

"""Atomically record usage and return current budget state.

Args:
scope_key: Composite scope identifier (e.g. "channel=slack|user_id=u1").
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doc should go into interface docs instead of here

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, will move to the protocol definition.

def record_and_check(
self,
scope_key: str,
period_key: str,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed to be passed in? Shouldn't implementation be figuring this out on its own based on current time?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair. The store should derive the period from window_seconds + current time internally. I was passing it in for testability but I can inject a clock instead.

input_tokens: int,
output_tokens: int,
cost_usd: float,
limit_usd: float | None = None,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to pass in limit here? Can't we instantiate the store with the BudgetEvaluatorConfig or something similar so it knows what limits are already?

If we want to share the store between many different kinds of limits and keys, can't we just pass in BudgetLimitRule here instead?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I'll have the store accept BudgetLimitRule at registration time so it knows its own limits. record_and_check just takes usage data.

# ---------------------------------------------------------------------------


class TestInMemoryBudgetStore:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests should follow Given/When/Then behavioral comment style.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will restructure. Quick example to confirm this is what you mean:

def test_single_record_under_limit(self):
    # Given: store with a $10 daily limit
    # When: record $3 of usage
    # Then: not breached, ratio ~0.3

Attributes:
scope: Static scope dimensions that must match for this rule
to apply. Empty dict = global rule.
per: If set, the limit is applied independently for each unique
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand this. Can't we just handle separate budgets by having multiple rules with different scope dicts?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For static scopes, agreed — multiple rules work. My concern is the dynamic case: "each user gets $10/day" where users aren't known at config time. With per, one rule covers all future users. Without it, you'd need to generate rules on the fly. Would a group_by field work? e.g. group_by: "user_id" means "apply this limit independently per distinct user_id value." Open to other approaches if you have something in mind.

@amabito
Copy link
Copy Markdown
Author

amabito commented Mar 24, 2026

@lan17 Thanks for the review. Moving to contrib — no objection, will restructure. Responded to each thread inline. Aiming for R2 within a week.

@lan17
Copy link
Copy Markdown
Contributor

lan17 commented Mar 30, 2026

@lan17 Thanks for the review. Moving to contrib — no objection, will restructure. Responded to each thread inline. Aiming for R2 within a week.

any updates @amabito ? Would it be ok with you if we take over this PR to get it over the finish line? This feature is going to be super useful to the whole project!

@amabito
Copy link
Copy Markdown
Author

amabito commented Mar 31, 2026

@lan17 Hey — update is mostly ready, was waiting on your reply on the cost computation thread before pushing. Got that answer now so I'll push the revision shortly.

…cking

New contrib evaluator "budget" that tracks cumulative token/cost usage
per agent, channel, user. Configurable time windows via window_seconds.

Design per reviewer feedback:
- Contrib evaluator (not builtin) for production hardening
- Integer limit + Currency enum (USD/EUR/tokens)
- window_seconds (int) instead of named windows
- group_by for dynamic per-user/per-channel budgets
- Evaluator owns cost computation from pricing table
- BudgetStore protocol + InMemoryBudgetStore (dict + Lock)
- Store derives period keys internally, injectable clock

Addresses agentcontrol#130.

55 tests (incl. thread safety, NaN/Inf, scope injection, double-count).
@amabito amabito force-pushed the feat/budget-evaluator branch from fa414d7 to a1345b9 Compare March 31, 2026 00:29
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 1, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown
Contributor

@lan17 lan17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think there are still a couple architectural issues to sort out before this is ready. the main ones for me are: budget state currently lives on a cached evaluator instance, per-call rounding materially overstates spend for small requests, and the currency model doesn't line up with what the evaluator actually stores. i also don't think the BudgetStore abstraction is quite a clean swap point yet.


def __init__(self, config: BudgetEvaluatorConfig) -> None:
super().__init__(config)
self._store = InMemoryBudgetStore(rules=config.limits)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this runs into a fundamental mismatch with the evaluator framework. evaluator instances are cached globally by (name, config) and reused across requests, so putting InMemoryBudgetStore on self means different controls with the same config will share budget state. that leaks usage across controls/selectors in a way that's going to be very hard to reason about. i don't think the fix is to bake control_id into config, but i do think this needs some unique engine-provided evaluation key / control-instance identity so the store can namespace state outside the cached evaluator instance.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also found that the LRU cache in _factory.py can evict the evaluator instance, which silently destroys the store and resets all counters. So this is worse than I thought.

Moving the store out of the evaluator into a module-level registry keyed by {evaluator_name}:{config_hash} with a threading.Lock.
Evaluator stays stateless, store survives eviction. Identical config = shared pool, which I think is the right default.

Re: engine-provided identity -- agree that's the right long-term direction. circuit_breaker has the same self._store issue. I think an EvaluatorContext param on evaluate() would cover both, but that touches all evaluator signatures so probably better scoped as its own change.

cost = (input_tokens * input_rate + output_tokens * output_rate) / 1000.0
if not math.isfinite(cost) or cost < 0:
return 0
return math.ceil(cost)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using ceil here means every non-zero call costs at least 1 minor unit. for example, 1 token at 30 cents / 1k becomes 1 whole cent instead of 0.03, so a stream of small calls will burn through budget way too fast. i think we need higher-precision accumulation here and should only round for display / external reporting, not on every call.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the math is bad. 100 tokens at $0.03/1k = 0.003 cents, ceil makes it 1 cent, 333x overcount.

Switching _estimate_cost() to return float, _Bucket accumulates float, ceil only at snapshot time. Protocol cost param goes from int to float too.

if val is not None:
model = str(val)

cost = _estimate_cost(model, input_tokens, output_tokens, self.config.pricing)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think the currency story is coherent yet. we compute a single cost value from the pricing table and then feed that same integer into every matching rule, regardless of whether the rule is USD, EUR, or TOKENS. so EUR rules are effectively checked against USD-denominated spend, and TOKENS doesn't really map to this cost path at all. i'd either constrain this to USD for now or add explicit per-unit accounting / conversion before shipping it.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Constraining to USD for v1. currency=EUR will raise ValueError.

Token limits are already orthogonal -- limit_tokens works independently in _compute_utilization() and the exceeded check.
TOKENS rules just use limit_tokens, cost path doesn't apply.



@runtime_checkable
class BudgetStore(Protocol):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like adding a BudgetStore protocol, but right now it doesn't feel like a clean abstraction boundary yet. important behavior assumptions like constructor-time rules, clock source, and internal period derivation live outside the protocol, and the evaluator still directly instantiates InMemoryBudgetStore. so in practice a redis/postgres backend still requires changing evaluator construction, not just swapping implementations. i'd either make store injection/provider selection part of the design now or keep this concrete until that boundary is clearer.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping concrete for now. Protocol stays as record_and_check only, InMemoryBudgetStore stays as the only implementation. Store injection makes more sense to design alongside the EvaluatorContext work where the store lifecycle gets clearer.

@amabito
Copy link
Copy Markdown
Author

amabito commented Apr 2, 2026

@lan17 Agreed on all four. Working on R3 now, responded to each inline thread with specifics.

…cking

R3 addressing lan17 review:
- Evaluator is stateless; budget state lives in module-level registry
  (fixes LRU eviction silent reset + cross-control state leakage)
- Cost accumulated as float, no per-call ceil (fixes 333x overcount)
- Currency simplified to USD only, token limits orthogonal
- Store kept concrete per reviewer suggestion
- 67 tests incl. thread safety, cache eviction, rounding boundary
@amabito amabito requested a review from lan17 April 8, 2026 02:19
Copy link
Copy Markdown
Contributor

@lan17 lan17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sticking with this @amabito — the iterative improvements from R1 through R3 are really solid and this is clearly on the right track. The module-level store registry and float accumulation fixes were exactly the right calls.

I've left a few more comments below, mostly around the API surface (limit model, pricing config, store interface). The biggest ones are probably the limit/limit_unit redesign and making BudgetStore an async ABC. Let me know if any of these are unclear — happy to discuss further.

scope: dict[str, str] = Field(default_factory=dict)
group_by: str | None = None
window_seconds: int | None = None
limit: int | None = None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redesign limit model: limit + limit_unit instead of separate fields

Consider consolidating limit and limit_tokens into a single limit: int + limit_unit design:

limit: int
limit_unit: Literal["usd_cents", "tokens"] = "usd_cents"

Benefits:

  • One limit per rule, one unit — unambiguous semantics
  • Extensible to other currencies later (e.g. "eur_cents") without new fields
  • If you need both a cost limit and a token limit on the same scope, write two rules — more explicit than two thresholds on one rule
  • Simplifies _compute_utilization and the exceeded check — each rule has exactly one threshold
  • Makes the unit self-documenting in config, which was a concern from R1

If you need both cost and token limits on the same scope, you write two rules. That's more explicit than one rule with two thresholds and makes the "which limit was breached" logic trivial.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looked at memory_store.py:60-72 and 159-166 -- limit and limit_tokens are evaluated independently within a single rule, both feeding utilization = max(cost_ratio, token_ratio) on one shared bucket. splitting to two rules is not a semantic drop-in: it changes the snapshot shape and breached_rules metadata structure.

one alternative that keeps the readability goal without losing the OR-breach semantic: rename limit -> limit_cents (explicit unit in the name) and keep limit_tokens as-is. ambiguity gone, no structural change.

happy to go either way, but wanted to flag this before ripping out the dual-limit path. which direction do you want?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think single limit is fine. To implement dual limit we could just easily have 2 budget evaluators with different configs, right?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair point about the OR-breach semantic. i still think single-limit-per-rule is the cleaner design long term — two rules with the same scope share a bucket anyway, so the only real difference is metadata shape. but i hear you that it's a structural change with downstream effects on snapshot consumers.

let's go with limit + limit_unit: Literal["usd_cents", "tokens"] = "usd_cents" for the cleaner extensibility story. if someone needs both cost and token enforcement on the same scope they write two rules — the metadata shows two breached_rules entries instead of one, which is arguably more debuggable anyway.

"""

limits: list[BudgetLimitRule] = Field(min_length=1)
pricing: dict[str, dict[str, float]] | None = None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a Pydantic model for pricing rates

The raw nested dict is stringly-typed. A Pydantic model gives you config-time validation and IDE support:

class ModelPricing(EvaluatorConfig):
    input_per_1k: float = 0.0
    output_per_1k: float = 0.0

class BudgetEvaluatorConfig(EvaluatorConfig):
    pricing: dict[str, ModelPricing] | None = None

Also: if any rule uses limit_unit="usd_cents", pricing should be required — validate this at config time rather than silently treating cost as 0 at runtime.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed. current rates.get("input_per_1k", 0.0) silently zeroes on typos -- config-time validation catches that.

the "required when usd_cents rules exist" validator depends on the limit model redesign landing first (it needs something to check against). so I'll sequence this after the limit change.

return v


class BudgetEvaluatorConfig(EvaluatorConfig):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add explicit budget_id field to replace config-hash store key

The current _config_key in evaluator.py hashes the serialized config to key the store registry. This means identical configs silently share budget state, and differently-ordered limits get normalized via sorting (but evaluation order still differs).

An explicit required field is cleaner:

budget_id: str  # unique identifier for this budget pool

Store key becomes budget:{budget_id}. This gives:

  • Explicit sharing/isolation — controlled by the user, not by config coincidence
  • No sorting ambiguity_config_key and its sorting logic can be dropped entirely
  • Debuggable — you can grep logs for the budget_id

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on board with dropping _config_key entirely. the sort-normalization in there is a workaround for the hash approach, not load-bearing, and the three TestStoreRegistry tests that lock in the current implicit-sharing semantic can be rewritten around explicit ids.

one thing to pin down before I rip it out: required or optional with a default? optional-with-uuid-default reintroduces implicit keying through the back door (anyone who forgets to set it gets a per-instance unique pool, no sharing). required is a harder break but the semantic is clean. which way?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

required with a default like budget_id: str = "default" sounds right. covers the single-budget case without friction and avoids the uuid back door.

just make sure the docs are clear that two evaluator instances with the same budget_id share a budget pool (accumulated spend, snapshots, everything). different budget_id = fully isolated.

return 0.0
rates = pricing.get(model)
if not rates:
return 0.0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unknown model silently gets cost=0, escaping cost-based budget limits

If a model isn't in the pricing table, cost is silently treated as 0. For a budget enforcement tool this is risky — a new model deployed without updating the pricing config bypasses all cost limits.

Suggestion: add a config option:

unknown_model_behavior: Literal["warn", "block"] = "warn"
  • "warn": log a warning, treat cost as 0 (current behavior + visibility)
  • "block": return matched=True — unknown model can't be budgeted, fail closed

At minimum, the current silent fallback should emit a log warning so operators notice the gap.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed this is a fail-open hole. one pushback on the default though: this is a budget enforcer, not a monitoring tool, so warn means a new model silently undercounts cost while traffic keeps flowing. for a tool whose job is to stop spend, fail-closed feels like the right default.

proposal: default to block, but only trigger when (a) the matched rule has a cost-based limit set and (b) pricing is non-null. token-only rules and configs without any pricing table stay unaffected, so existing token-only users don't get broken by the change.

ok with block as default, or do you want warn kept for backcompat reasons?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, block as default makes more sense for a budget enforcer. your scoping is right too — only trigger when the rule is cost-based and pricing is configured. token-only rules stay unaffected.

good thinking on this one. the whole R4 revision is solid work — the TTL prune design, the async ABC with MRO enforcement, and the defensive guards are all well thought through. appreciate the back and forth on these design questions.

val = _extract_by_path(data, token_path)
if isinstance(val, int) and not isinstance(val, bool) and val >= 0:
return 0, val
if isinstance(val, dict):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: token attribution when only total is available

When token_path resolves to a single int, this attributes the entire count to output_tokens, returning (0, total). If the pricing table has different input vs output rates, cost will be skewed toward the output rate. Worth a brief comment here noting this is intentional best-effort behavior.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, it's a deliberate conservative choice -- output rates are usually higher than input, so all-to-output over-estimates rather than under-estimates. will add a one-line comment on 112 noting the intent.

self._rules = rules
self._clock = clock
self._lock = threading.Lock()
self._buckets: dict[tuple[str, str], _Bucket] = {}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale period buckets accumulate unboundedly

_buckets grows a new entry every time window per scope key. With window_seconds=3600 and 100 scope keys, that's 2,400 new buckets/day. Old period buckets are never read again but never cleaned up.

The max_buckets cap triggers fail-closed but doesn't distinguish stale entries from active ones — a long-running process will eventually hit it from history alone. Consider adding TTL-based cleanup: when creating a new period bucket, prune entries whose period key is older than 2 * window_seconds. This keeps memory bounded by active scopes rather than history length.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch. plan is a rollover-scan TTL prune:

  • _parse_period_key("P86400:19800") -> (86400, 19800)
  • _last_pruned_period: dict[int, int] tracks last pruned index per window size
  • on new bucket creation with non-empty period key, if _last_pruned_period[window] != current_index, scan _buckets, drop entries with same window and bucket_index < current - 1, update the tracker

O(n) scan but amortized O(1) per record, and under expected scope counts (<100) the scan is microseconds in Python. a heap adds invalidation complexity because of reset() and isn't worth it at this scale.

max_buckets stays as a hard backstop for high-cardinality group_by explosions (e.g. per-user keys) that TTL can't protect against -- will document this in the store docstring.



@runtime_checkable
class BudgetStore(Protocol):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BudgetStore should be an ABC with async methods

Two changes:

1. Protocol → ABC: The evaluator framework uses ABCs (Evaluator is an ABC). Budget stores have shared behavior worth inheriting (input validation, logging), and backends are a finite explicitly-discovered set. ABC is the more consistent and discoverable choice.

2. Sync → Async: record_and_check is synchronous but called from an async evaluate. A Redis or Postgres backend needs async I/O — the current sync interface would force run_in_executor. Make this async from the start:

class BudgetStore(ABC):
    @abstractmethod
    async def record_and_check(
        self,
        scope: dict[str, str],
        input_tokens: int,
        output_tokens: int,
        cost: float,
    ) -> list[BudgetSnapshot]: ...

For InMemoryBudgetStore, the async methods are thin wrappers around the existing sync logic.

Also: the cost parameter accepts negative values (tested in test_negative_cost_reduces_spend), but _estimate_cost never produces negative cost. Consider validating cost >= 0 here, or explicitly documenting when negative cost is expected (credits/refunds?).

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on Protocol -> async ABC: agreed. Evaluator is already an ABC so this matches the framework shape. keeping threading.Lock internally -- the critical section has no await, so the public async def record_and_check wraps a sync helper under the existing lock. that keeps test_thread_safety (test_budget.py:165) intact and doesn't require reworking InMemoryBudgetStore's concurrency model.

on cost >= 0 validation: pushback. test_negative_cost_reduces_spend (test_budget.py:655) is intentional refund semantic. round_spent() (store.py:47) already floors the displayed spent to 0, but the internal accumulator can go negative so a subsequent positive charge cancels correctly. and _estimate_cost() (evaluator.py:154-155) already guards cost < 0 -> 0.0, so negative cost can't reach the store through the normal evaluate path today -- it's only reachable through direct store injection (used by that one test).

plan: document the refund semantic in the ABC docstring, no runtime validation. if you'd rather remove the refund path entirely that's a bigger call and I'd want your go-ahead before doing it.

"BudgetStore",
"InMemoryBudgetStore",
"clear_budget_stores",
]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clear_budget_stores() is a testing utility — remove it from __all__. Tests can import it directly from the submodule.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, will drop from __all__. from agent_control_evaluator_budget.budget.evaluator import clear_budget_stores still works for tests, so no test breakage.

pyproject.toml Outdated
allowed_tags = ["feat", "fix", "perf", "chore", "docs", "style", "refactor", "test", "ci"]
patch_tags = ["fix", "perf", "chore", "refactor"]

[dependency-groups]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds [dependency-groups] to the root pyproject.toml. The budget evaluator already has its own dev dependencies in its own pyproject.toml. This root-level change appears unrelated — please revert.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverting the [dependency-groups] block. also noticed this commit accidentally regressed version = "7.3.1" to "7.3.0" in the same file -- will restore that too.

@@ -0,0 +1,3 @@
# Budget Evaluator
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The README needs more content before merge. At minimum:

  • A complete config example (limits, pricing, metadata_paths)
  • Explanation of scope and group_by semantics
  • How the pricing table works
  • Important caveat: InMemoryBudgetStore is single-process only — budget state is not shared across workers/pods and is lost on restart. Users need to understand this limitation before relying on it for enforcement.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

current file is ~3 lines, yeah. rewrite plan for R4:

  • install (pip install agent-control-evaluator-budget)
  • quickstart: minimal yaml config + python snippet that actually runs
  • config reference: all BudgetEvaluatorConfig fields, types, defaults
  • scope semantics (exact dict match, empty dict = global bucket)
  • group_by (per-value sub-buckets, one per unique value)
  • pricing table format and unit (cents)
  • critical caveat: InMemoryBudgetStore is single-process only, no persistence, state lost on restart. not suitable for multi-worker deploys -- swap in a Redis/Postgres backend for that.
  • window constants reference (WINDOW_HOURLY/DAILY/WEEKLY/MONTHLY)
  • clear_budget_stores() note: testing utility, don't call in production

landing with R4.

@amabito
Copy link
Copy Markdown
Author

amabito commented Apr 10, 2026

@lan17 pulling three open questions out of the inline threads so they don't get lost:

  1. limit model shape. splitting limit + limit_tokens into two separate rules loses the single-rule dual-ceiling semantic (where both feed max(cost_ratio, token_ratio) on the same bucket, and breached_rules metadata is one entry). alternative that keeps the semantic: rename limit -> limit_cents and leave limit_tokens in place. which do you prefer?

  2. budget_id: required or optional? required gives a clean explicit-sharing semantic but breaks existing configs. optional with a uuid default is non-breaking but any user who forgets to set it gets a unique per-instance pool, which is basically the hash approach through the back door. leaning required, want to confirm.

  3. unknown_model_behavior default. warn is fail-open -- a new model silently undercounts cost while traffic flows through. for a budget enforcer, block (scoped to cost-limit rules with non-null pricing, so token-only users aren't affected) feels like the right fail-closed default. ok with block?

will hold the config-layer work (limit shape, budget_id, ModelPricing validator, unknown-model default) until these land. in the meantime planning to move forward on the mechanical items: root pyproject.toml revert, clear_budget_stores out of __all__, token attribution comment, async ABC + threading.Lock wrapping for BudgetStore, stale period bucket TTL prune, and README rewrite. let me know if any of that should wait too.

…ards

Respond to lan17's R3 review on PR agentcontrol#144 with the mechanical items that
do not depend on pending config-layer decisions (limit model, budget_id,
unknown_model_behavior).

Changes:
- Migrate BudgetStore from Protocol to async ABC with __init_subclass__
  guard that walks the MRO to reject sync overrides at class creation
- InMemoryBudgetStore: async wrapper around sync helper, threading.Lock
  retained for CPU-bound critical section
- TTL prune for stale period buckets on rollover, runs before max_buckets
  capacity check so rollover at capacity reclaims space
- Monotonic prune watermark (rejects backwards clock)
- _compute_utilization low-side clamp to [0.0, 1.0] (refund semantic)
- Defensive guards: NaN/Inf cost and clock coerced to 0.0, negative
  token counts clamped to 0
- Revert root pyproject.toml (remove unrelated [dependency-groups],
  restore version 7.3.1)
- Remove clear_budget_stores from __all__ (testing utility)
- Document token attribution intent (single int -> output-only)

Tests: 67 -> 91 (24 new: async migration, TTL prune coverage, adversarial
guards, ABC contract enforcement)
@amabito
Copy link
Copy Markdown
Author

amabito commented Apr 10, 2026

@lan17 pushed R4 (b189e85). here's what landed and what's still waiting on your answers from the inline threads.

landed in R4:

  • BudgetStore is now an async ABC. InMemoryBudgetStore wraps a sync helper under threading.Lock -- same concurrency model, just honest about the interface for future Redis/Postgres backends.
  • __init_subclass__ walks the MRO and rejects sync overrides of record_and_check at class creation time, including mixin-inherited ones.
  • TTL prune for stale period buckets. runs before the max_buckets capacity check on rollover so long-running processes don't permanently fail-closed from history alone. max_buckets stays as a backstop for high-cardinality group_by explosions.
  • defensive guards: NaN/Inf cost and clock coerced to 0.0, negative token counts clamped to 0, utilization clamped to [0.0, 1.0] for refund scenarios.
  • root pyproject.toml reverted (version back to 7.3.1, [dependency-groups] removed).
  • clear_budget_stores dropped from __all__ (still importable from the evaluator submodule for tests).
  • token attribution comment added.

67 -> 91 tests. all pass, ruff clean.

waiting on your call (3 open questions from the earlier comment):

  1. limit model shape (limit_cents + limit_tokens vs limit + limit_unit)
  2. budget_id required or optional
  3. unknown_model_behavior default (block vs warn)

README rewrite is staged but holding until the config shape is finalized so the examples are accurate. lmk if any of the R4 changes need adjustment.

@amabito
Copy link
Copy Markdown
Author

amabito commented Apr 10, 2026

also wanted to say -- the R1 through R3 review rounds genuinely improved the design. the store registry namespace issue and the float accumulation fix from R2 were both real production bugs, and the async ABC + TTL prune from R3 set up the interface correctly for distributed backends from the start. this is way better than what I would have shipped on the first pass. appreciate the thorough reviews.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants