Component: com.azure.cosmos.implementation.changefeed.epkversion.PartitionProcessorImpl (and pkversion equivalent)
Summary
When a CosmosException is logged at WARN level in the ChangeFeedProcessor's partition processing loop, the logger invokes CosmosException.getMessage() to render the exception. This method serializes the full CosmosDiagnostics payload as JSON — including request timelines, region contacts, retry history, and metadata. Under heavy db connection load (i.e. thousands of connections) this can result in multi-MB log lines.
Where it occurs
PartitionProcessorImpl.run() → onErrorResume handler (the error classification/recovery path in the per-partition change feed polling loop).
This path is triggered on every CosmosException from createDocumentChangeFeedQuery, including transient errors (429 throttles, timeouts, connectivity issues) which are expected and frequent under load. Since each leased partition runs its own processor instance, a busy ChangeFeedProcessor can hit this path concurrently across many partitions.
Root cause
CosmosException.getMessage() (line 263 of CosmosException.java) builds a JSON object containing the full CosmosDiagnostics:
public String getMessage() {
ObjectNode messageNode = mapper.createObjectNode();
messageNode.put("innerErrorMessage", innerErrorMessage());
if (cosmosDiagnostics != null) {
cosmosDiagnostics.fillCosmosDiagnostics(messageNode, null);
}
return mapper.writeValueAsString(messageNode);
}
When the exception is passed to logger.warn(message, exception), SLF4J calls toString() → getMessage(), producing the potentially massive payload.
Impact
- massive individual log lines (we've seen them exceeding 30+ MB)
- Risk of log ingestion pipeline throttling or rejection
- Increased log storage costs
- Potential memory pressure from repeated serialization of diagnostics on transient errors in a hot loop
PossibleFix
Split existing log into two logging tiers:
- WARN level: Log with CosmosException.getShortMessage() (lightweight, no diagnostics)
- DEBUG level: Log the full exception (with diagnostics + stack trace) for deep troubleshooting
This preserves the log message signature for pattern matching while eliminating the excessive payload at default log levels.
Component: com.azure.cosmos.implementation.changefeed.epkversion.PartitionProcessorImpl (and pkversion equivalent)
Summary
When a CosmosException is logged at WARN level in the ChangeFeedProcessor's partition processing loop, the logger invokes CosmosException.getMessage() to render the exception. This method serializes the full CosmosDiagnostics payload as JSON — including request timelines, region contacts, retry history, and metadata. Under heavy db connection load (i.e. thousands of connections) this can result in multi-MB log lines.
Where it occurs
PartitionProcessorImpl.run() → onErrorResume handler (the error classification/recovery path in the per-partition change feed polling loop).
This path is triggered on every CosmosException from createDocumentChangeFeedQuery, including transient errors (429 throttles, timeouts, connectivity issues) which are expected and frequent under load. Since each leased partition runs its own processor instance, a busy ChangeFeedProcessor can hit this path concurrently across many partitions.
Root cause
CosmosException.getMessage() (line 263 of CosmosException.java) builds a JSON object containing the full CosmosDiagnostics:
When the exception is passed to logger.warn(message, exception), SLF4J calls toString() → getMessage(), producing the potentially massive payload.
Impact
PossibleFix
Split existing log into two logging tiers:
This preserves the log message signature for pattern matching while eliminating the excessive payload at default log levels.