[cosmos] ChangeFeedProcessor: CosmosException logging can produce multi-MB log lines due to full diagnostics serialization

**Component**: com.azure.cosmos.implementation.changefeed.epkversion.PartitionProcessorImpl (and pkversion equivalent)

**Summary**
When a CosmosException is logged at WARN level in the ChangeFeedProcessor's partition processing loop, the logger invokes CosmosException.getMessage() to render the exception. This method serializes the full CosmosDiagnostics payload as JSON — including request timelines, region contacts, retry history, and metadata.  Under heavy db connection load (i.e. thousands of connections) this can result in multi-MB log lines.

**Where it occurs**
PartitionProcessorImpl.run() → onErrorResume handler (the error classification/recovery path in the per-partition change feed polling loop).

This path is triggered on every CosmosException from createDocumentChangeFeedQuery, including transient errors (429 throttles, timeouts, connectivity issues) which are expected and frequent under load. Since each leased partition runs its own processor instance, a busy ChangeFeedProcessor can hit this path concurrently across many partitions.

**Root cause**

CosmosException.getMessage() (line 263 of CosmosException.java) builds a JSON object containing the full CosmosDiagnostics:

```
 public String getMessage() {
     ObjectNode messageNode = mapper.createObjectNode();
     messageNode.put("innerErrorMessage", innerErrorMessage());
     if (cosmosDiagnostics != null) {
         cosmosDiagnostics.fillCosmosDiagnostics(messageNode, null);
     }
     return mapper.writeValueAsString(messageNode);
 }
```

When the exception is passed to logger.warn(message, exception), SLF4J calls toString() → getMessage(), producing the potentially massive payload.

**Impact**

 - massive individual log lines (we've seen them exceeding 30+ MB)
 - Risk of log ingestion pipeline throttling or rejection
 - Increased log storage costs
 - Potential memory pressure from repeated serialization of diagnostics on transient errors in a hot loop

**PossibleFix**

Split existing log into two logging tiers:

 - WARN level: Log with CosmosException.getShortMessage() (lightweight, no diagnostics)
 - DEBUG level: Log the full exception (with diagnostics + stack trace) for deep troubleshooting

This preserves the log message signature for pattern matching while eliminating the excessive payload at default log levels.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[cosmos] ChangeFeedProcessor: CosmosException logging can produce multi-MB log lines due to full diagnostics serialization #49320

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[cosmos] ChangeFeedProcessor: CosmosException logging can produce multi-MB log lines due to full diagnostics serialization #49320

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions