Skip to content

[cosmos] ChangeFeedProcessor: CosmosException logging can produce multi-MB log lines due to full diagnostics serialization #49320

@jaxpod70

Description

@jaxpod70

Component: com.azure.cosmos.implementation.changefeed.epkversion.PartitionProcessorImpl (and pkversion equivalent)

Summary
When a CosmosException is logged at WARN level in the ChangeFeedProcessor's partition processing loop, the logger invokes CosmosException.getMessage() to render the exception. This method serializes the full CosmosDiagnostics payload as JSON — including request timelines, region contacts, retry history, and metadata. Under heavy db connection load (i.e. thousands of connections) this can result in multi-MB log lines.

Where it occurs
PartitionProcessorImpl.run() → onErrorResume handler (the error classification/recovery path in the per-partition change feed polling loop).

This path is triggered on every CosmosException from createDocumentChangeFeedQuery, including transient errors (429 throttles, timeouts, connectivity issues) which are expected and frequent under load. Since each leased partition runs its own processor instance, a busy ChangeFeedProcessor can hit this path concurrently across many partitions.

Root cause

CosmosException.getMessage() (line 263 of CosmosException.java) builds a JSON object containing the full CosmosDiagnostics:

 public String getMessage() {
     ObjectNode messageNode = mapper.createObjectNode();
     messageNode.put("innerErrorMessage", innerErrorMessage());
     if (cosmosDiagnostics != null) {
         cosmosDiagnostics.fillCosmosDiagnostics(messageNode, null);
     }
     return mapper.writeValueAsString(messageNode);
 }

When the exception is passed to logger.warn(message, exception), SLF4J calls toString() → getMessage(), producing the potentially massive payload.

Impact

  • massive individual log lines (we've seen them exceeding 30+ MB)
  • Risk of log ingestion pipeline throttling or rejection
  • Increased log storage costs
  • Potential memory pressure from repeated serialization of diagnostics on transient errors in a hot loop

PossibleFix

Split existing log into two logging tiers:

  • WARN level: Log with CosmosException.getShortMessage() (lightweight, no diagnostics)
  • DEBUG level: Log the full exception (with diagnostics + stack trace) for deep troubleshooting

This preserves the log message signature for pattern matching while eliminating the excessive payload at default log levels.

Metadata

Metadata

Assignees

No one assigned

    Labels

    ClientThis issue points to a problem in the data-plane of the library.CosmosService AttentionWorkflow: This issue is responsible by Azure service team.customer-reportedIssues that are reported by GitHub users external to the Azure organization.questionThe issue doesn't require a change to the product in order to be resolved. Most issues start as that

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions