fix(langchain): handle Anthropic cache_creation nested-dict in _parse_usage_model#1698
fix(langchain): handle Anthropic cache_creation nested-dict in _parse_usage_model#1698ASAD-BE18 wants to merge 6 commits into
Conversation
…_usage_model
Anthropic's extended prompt-caching API returns cache_creation as a nested
dict keyed by cache tier (e.g. {"ephemeral_1h_input_tokens": 500,
"ephemeral_5m_input_tokens": 0}) inside the LLM output usage object.
Previously this value was passed through to UpdateGenerationBody.usageDetails
unchanged. On v2.x (final filter: not None/not str) the nested dict
survived, causing a Pydantic ValidationError ("value is not a valid integer")
that was silently swallowed by the ingestion queue — dropping the entire
generation end() event and leaving every trace with endTime=null and
input/output=0. On v4.x (final filter: isinstance(v, int)) the nested dict
was silently discarded, losing all cache-creation token data.
Fix: before the final int-filter, pop cache_creation, flatten each tier
value into a cache_creation_{tier_key} entry, and set the aggregated
cache_creation_input_tokens total via setdefault so the legacy field is
not overwritten if already present.
Closes langfuse#1697
|
@claude review |
There was a problem hiding this comment.
LGTM — small, well-scoped fix that flattens Anthropic's nested cache_creation dict; production logic is correct and the inline nit is a test-quality improvement, not a blocker.
Extended reasoning...
Overview
This PR adds a small block inside _parse_usage_model in langfuse/langchain/CallbackHandler.py to handle Anthropic's extended prompt-caching API, where cache_creation arrives as a nested dict (per-tier token counts) rather than a scalar. The block pops the dict, flattens each tier into cache_creation_{tier_key} (mirroring the existing input_token_details pattern), and uses setdefault to expose an aggregate cache_creation_input_tokens without clobbering a pre-existing scalar. Three new unit tests cover the flatten case, all-zeros case, and legacy-field preservation; existing tier tests are unchanged.
Security risks
None. The change is a pure dict-transformation with isinstance guards; no I/O, auth, crypto, or untrusted-input parsing surface is touched.
Level of scrutiny
Low. This is a self-contained parsing helper change behind isinstance(usage_model, dict) and isinstance(usage_model["cache_creation"], dict) guards, with isinstance(int) per-tier filtering that matches the function's existing defensive style. The production logic correctly handles the three relevant cases (flatten, all-zeros aggregate suppression, legacy-field preservation).
Other factors
The single finding is a nit about test_anthropic_cache_creation_legacy_field_not_overwritten — the legacy value (300) equals the tier sum (300), so the test would also pass under a regressed plain-assignment implementation. The production code is correct; this is purely a test-discrimination improvement worth picking up but not blocking. The maintainer explicitly requested review and there are no outstanding human-reviewer comments.
The legacy value (300) equalled the tier sum (200+100=300), so the assertion would pass even if setdefault were replaced with a plain assignment. Use 999 to make the test discriminating.
Closes #1697
What changed
_parse_usage_modelinlangfuse/langchain/CallbackHandler.pynow handles Anthropic'scache_creationfield when it arrives as a nested dict (tiered prompt-caching API,langchain-anthropic≥ 0.3.x).Root cause
Anthropic's extended prompt-caching API includes a
cache_creationkey in the LLM output usage dict:{ "input_tokens": 9454, "output_tokens": 380, "cache_read_input_tokens": 0, "cache_creation": { "ephemeral_1h_input_tokens": 500, "ephemeral_5m_input_tokens": 0, }, }v2.x (≤ 2.60.10) — The old final filter (
v is not None and not isinstance(v, str)) let the nested dict pass through intoUpdateGenerationBody.usageDetails, which is typed asUnion[Dict[str, int], ...]. Pydantic raised aValidationError("value is not a valid integer") that was swallowed by the ingestion queue, silently dropping the entire generationend()event — every trace hadendTime=nullandinput=0 / output=0.v4.x (HEAD before this PR) — The
isinstance(v, int)filter correctly prevented the crash, but thecache_creationdict was silently discarded, losing all cache-creation token data.Fix
Before the final
isinstance(v, int)filter, popcache_creationand flatten each per-tier value into acache_creation_{tier_key}entry (matching the pattern already used forinput_token_details). If the total is non-zero, also setcache_creation_input_tokensviasetdefaultso the legacy scalar field is not overwritten if already present.Tests added (
tests/unit/test_parse_usage_model.py)test_anthropic_cache_creation_dict_flattenedtest_anthropic_cache_creation_all_zeros_no_aggregatetest_anthropic_cache_creation_legacy_field_not_overwrittensetdefaultpreserves existingcache_creation_input_tokensvalueExisting
test_standard_tier_input_token_detailsandtest_priority_tier_not_subtractedstill pass unchanged.Verification
All pass. No live-provider tests are needed since the fix is unit-testable.
Greptile Summary
This PR fixes
_parse_usage_modelinlangfuse/langchain/CallbackHandler.pyto correctly handle the Anthropic extended prompt-caching API wherecache_creationarrives as a nested dict of per-tier token counts rather than a scalar integer.cache_creationwhen it is a dict, flattens each tier into acache_creation_{tier_key}key, and usessetdefaultto set the legacy aggregatecache_creation_input_tokensonly when the total is non-zero, preserving any pre-existing scalar value.Confidence Score: 5/5
Safe to merge — the change is narrowly scoped to a single parsing helper, all existing tests still pass, and three new unit tests cover the key scenarios introduced by this fix.
The fix is a well-contained addition inside an already-guarded if isinstance(usage_model, dict) block, the logic is straightforward dict-flattening with no side effects on other fields, and the setdefault correctly prevents clobbering an existing scalar. The one theoretical gap — a non-integer tier value being silently skipped — mirrors how the rest of the function already handles unexpected types via the final isinstance(v, int) filter.
No files require special attention.
Flowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD A[_parse_usage_model called] --> B{has __dict__?} B -- yes --> C[usage equals usage.__dict__] B -- no --> D[usage_model equals usage.copy] C --> D D --> E{OpenAI format?} E -- yes --> F[return usage_model as-is] E -- no --> G[Apply conversion_list key renames] G --> H[Flatten input_token_details] H --> I[Flatten output_token_details] I --> J[Flatten Vertex AI token details] J --> N{cache_creation present AND is dict?} N -- no --> P N -- yes --> O[Pop cache_creation dict and flatten tier keys] O --> O2{total greater than 0?} O2 -- yes --> O3[setdefault cache_creation_input_tokens equals total] O2 -- no --> P O3 --> P[Final filter: keep only int values] P --> Q{usage_model empty?} Q -- yes --> R[return None] Q -- no --> S[return usage_model]Reviews (1): Last reviewed commit: "fix(langchain): handle Anthropic cache_c..." | Re-trigger Greptile