SOLR-18189: Skip indexing duplicate docs via a content hash URP by fhuaulme · Pull Request #4263 · apache/solr

fhuaulme · 2026-04-03T13:41:48Z

https://issues.apache.org/jira/browse/SOLR-18189

Description

Solr may receive similar documents for updates. Depending on use case, these updates may be unnecessary so discard such updates would be beneficial.

Solution

Content hash detection aims to improve update efficiency by identifying and bypassing redundant document updates. By detecting instances where content remains unchanged, the URP chain skips unnecessary write operations to reduce CPU and I/O overhead on nodes by skipping operations done in downstream URPs.

Tests

Unit tests added for URP and URP factory (namely org.apache.solr.update.processor.ContentHashVersionProcessorTest and org.apache.solr.update.processor.ContentHashVersionProcessorFactoryTest).

Checklist

Please review the following and check all that apply:

I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
I have created a Jira issue and added the issue ID to my pull request title.
I have given Solr maintainers access to contribute to my PR branch. (optional but recommended, not available for branches on forks living under an organisation)
I have developed this patch against the main branch.
I have run ./gradlew check.
I have added tests for my changes.
I have added documentation for the Reference Guide
I have added a changelog entry for my change

https://issues.apache.org/jira/browse/SOLR-18189

dsmiley

Partial review...

dsmiley · 2026-04-03T15:29:53Z

solr/core/src/java/org/apache/solr/update/processor/ContentHashVersionProcessor.java

+import java.util.function.Predicate;
+
+/**
+ * An implementation of {@link UpdateRequestProcessor} which computes a hash of selected doc values, and uses this hash


"doc values" here could be confusing... as Lucene/Solr has "docValues" and this URP in fact uses that technical mechanism.

I took a look at DocValues and I'm not sure it applies to this URP. In there a wording you prefer? (like "field values" ?)

Yes, "field values".

This PR Is using Lucene docValues, but it's not plainly clear to non-Lucene/Solr experts.
org.apache.solr.handler.component.RealTimeGetComponent#getInputDocument fetches the document from the updateLog or the index -- and if it does fetch it from the index, the values will come from org.apache.solr.search.SolrDocumentFetcher that will in turn fetch it from DocValues if docValues=true in the schema. SDF javadocs say so.

dsmiley · 2026-04-03T15:57:42Z

solr/core/src/java/org/apache/solr/update/processor/ContentHashVersionProcessor.java

+    return true; // Doc not found
+  }
+
+  private DocFoundAndOldUserAndSolrVersions getOldUserVersionsFromStored(BytesRef indexedDocId)


IMO we should not support this. It's not unreasonable to demand that someone using this feature enable docValues on the hash field.

dsmiley · 2026-04-03T16:02:32Z

solr/core/src/java/org/apache/solr/update/processor/ContentHashVersionProcessor.java

+  }
+
+  public String computeDocHash(SolrInputDocument doc) {
+    List<String> docIncludedFieldNames =


This results in a double-navigation of the fields of a document in order to have the hash be insensitive to the order of fields added. But in practice, a given end-to-end pipeline is going to be deterministic. Hmm. I suppose its no big deal. To improve, you cold instead stream SolrInputField, sorting by name, and get the value from them without another lookup. You could also do the whole thing as a stream pipeline without needlessing producing a List. It'd end in a forEach to process the value.

Good point, I'm changing this to remove the double loop on the doc's fields and use Stream APIs.

dsmiley · 2026-04-03T16:07:06Z

solr/core/src/java/org/apache/solr/update/processor/ContentHashVersionProcessor.java

+    }
+  }
+
+  private static class DocFoundAndOldUserAndSolrVersions {


should this be a record? Or simply an Optional of the oldHash?

Good point. I'm going to change this to Optional<String>

dsmiley · 2026-04-03T16:08:09Z

solr/core/src/java/org/apache/solr/update/processor/ContentHashVersionProcessor.java

+        .encodeToString(signature); // Makes a base64 hash out of signature value
+  }
+
+  interface OldDocProvider {


realistically, what impls are we going to have?

The last remaining use case for this interface is a way to ease unit test of the URP. One other possibility could be to use an actual URP chain in the test and add actual docs, do you recommend adding a new test URP chain for these tests?

I would prefer an urp chain. Perhaps look at DocBasedVersionConstraintsProcessor's tests for inspiration, as that is this URPs closest cousin. More like twin brother -- very similar.

dsmiley · 2026-04-03T16:23:27Z

solr/core/src/java/org/apache/solr/update/processor/ContentHashVersionProcessor.java

+    } finally {
+      rsp.addToLog("numAddsExisting", sameCount + differentCount + unknownCount);
+      rsp.addToLog("numAddsExistingWithIdentical", sameCount);
+      rsp.addToLog("numAddsExistingUnknown", unknownCount);


RE "numAdds" -- lets use "contentHash" prefix so as to associate these with this URP, and not confusing it with DocBasedVersionConstraintsProcessor which ought to include similar logs but doesn't yet do so.

IMO adding a new doc shouldn't add any log. I would expect one of these to incorporate the word "drop" or "discard" to reflect an action taken. I understand this URP can be configured to not take that action, albeit that would be in an exploratory / trial situation that ought to be temporary (I imagine).

dsmiley · 2026-04-03T16:25:57Z

solr/core/src/java/org/apache/solr/update/processor/ContentHashVersionProcessor.java

+    this.discardSameDocuments = discardSameDocuments;
+  }
+
+  private boolean validateHash(BytesRef indexedDocId, String newHash) throws IOException {


Perhaps we should align with a similar name in DocBasedVersionConstraintsProcessor, and rename to hashIsAcceptable ?

dsmiley · 2026-04-03T16:28:16Z

solr/core/src/java/org/apache/solr/update/processor/ContentHashVersionProcessorFactory.java

+  private static final char SEPARATOR = ','; // Separator for included/excluded fields
+  static final String CONTENT_HASH_ENABLED_PARAM = "contentHashEnabled";
+  private List<String> includeFields =
+      Collections.singletonList("*"); // Included fields defaults to 'all'


modern Java prefers List.of

dsmiley · 2026-04-03T16:34:30Z

solr/core/src/java/org/apache/solr/update/processor/ContentHashVersionProcessorFactory.java

+      Collections.singletonList("*"); // Included fields defaults to 'all'
+  private List<String> excludeFields = new ArrayList<>();
+  // No excluded field by default, yet hashFieldName is excluded by default
+  private String hashFieldName =


arguably we should insist this be provided rather than some auto-magic name

I made this parameter mandatory.

dsmiley · 2026-04-03T16:55:48Z

solr/core/src/java/org/apache/solr/update/processor/ContentHashVersionProcessor.java

+  private boolean validateHash(BytesRef indexedDocId, String newHash) throws IOException {
+    assert null != indexedDocId;
+
+    var docFoundAndOldUserVersions = getOldUserVersionsFromStored(indexedDocId);


This is one possible algorithm -- lookup hash from docValues of a doc that we got by first looking up the current doc by ID (a terms index lookup). But this is ~twice as expensive as doing a terms index lookup of the new hash, assuming we find no doc with that hash. If we do find one, then it's similar, as we then need to fetch the ID via docValues to see if it's the same doc. But we're about to drop the doc then, which isn't the common path. Even "need" there is debatable; it'd be extremely unlikely to find a match by content hash to a different doc ID.

Basically -- does the content hash field have indexed=true XOR docValues=true. We don't need to support both (IMO) -- we can choose what's most efficient for this use-case.
I understand you may not have time to implement another algorithm.

Just an FYI: the postingsFormat (of the terms index) is configurable and there are a number of lovely implementations, incuding a bloom filter one, which would make a lot of sense for "lookup" / exact match only use-case. Such a choice is up to the user to configure in the schema if they wish.

Oh wait... docs that have been but not yet visible via the IndexSearcher will not have their content hashes searchable. Albeit in such a situation, the "worst" thing that would happen is we index a doc that could have been dropped/discarded -- a missed optimization opportunity.

A terms lookup on the hash should be much faster than the RTG + double-docValues lookups though.

Could emulate the approach of DocBasedVersionConstraintsProcessor by first consulting RealTimeGetComponent.getInputDocumentFromTlog (which is really quite cheap, especially, when empty). If there, we have the hash. If not, do the terms index lookup into the index via realtime searcher.

Could emulate the approach of DocBasedVersionConstraintsProcessor by first consulting RealTimeGetComponent.getInputDocumentFromTlog (which is really quite cheap, especially, when empty). If there, we have the hash. If not, do the terms index lookup into the index via realtime searcher.

This is what org.apache.solr.handler.component.RealTimeGetComponent#getInputDocument is doing (it first checks in tlog).

=URP= * Remove OldDocProvider internal interface * Rework responses params (to include URP name) * Rework hash computation to prevent successive loops over fields * Hash field needs to be explicitly specified in config * Removes per request enablement of URP =Javadoc/naming= * Take into account doc values/field values in javadoc * Prefer "drop" over "discard" in doc/variables/method names * Removes remaining Collections.singletonList() =Unit test= * Adds utility method in SolrTestCaseJ4 to get update response * Rework unit test to use actual URP chain, remove mocks * Improve test coverage for URP factory parameters (excluded fields) * Improve test coverage for URP (drop, log modes, multi-valued fields)

=URP= * Remove while(true), move to for-i loop

dsmiley · 2026-04-11T03:00:44Z

I replaced the String/base64 usage with bytes & BinaryField.

I asked Claude: Compare this feature to DocBasedVersionConstraintsProcessor in terms of implementation. As a code reviewer, are we missing something here?

Missing isNotLeader check (significant)

DocBasedVersionConstraintsProcessor.processAdd immediately bails on non-leader nodes:
if (isNotLeader(cmd)) {
super.processAdd(cmd);
return;
}
ContentHashVersionProcessor has no such guard. In a distributed cluster, every replica would also call RealTimeGetComponent.getInputDocument on every update — wasteful and potentially wrong
during REPLAY/PEER_SYNC phases.

Double ++i in retry loop (bug)

Current code:
for (int i = 0; ; i++) { // i++ here
...
} catch (SolrException e) {
if (e.code() != 409) throw e;
++i; // second increment — wrong
}
}
DocBasedVersionConstraintsProcessor uses continue instead, letting only the loop header increment. The logOverlyFailedRetries mask (i & 255) == 255 fires at wrong cadence.

Minor: hashField not from actual schema

DocBasedVersionConstraintsProcessor does schema.getField(...) so the SchemaField reflects actual configured properties. ContentHashVersionProcessor constructs new SchemaField(hashFieldName, new
BinaryField()) with default properties. Fine for now since only .getName() is used, but if someone later uses hashField for type dispatch they'll get wrong results.

dsmiley

DocBasedVersionConstraintsProcessor insists it's before DistributedUpdateProcessor and I'm sure it has to do with the optimistic concurrency related to its use of _version_ (DUP is where that's processed / honored). This means it needs the isLeader check as well.

I think a content hash URP is most easily/clearly placed someplace after DUP.
I suspect a content hash URP could usefully be on either side... but if it is placed before DURP then it needs the leader check similarly to DBVCP. If it's after, then I don't see a leader check being needed. So lets do that?

Furthermore, DUP resolves partial updates (internally called "atomic updates") to complete documents; we don't want this URP seeing a partial update. All the more reason to place this after DUP.

This matter is something to document to people know where to place this in their chain.

dsmiley · 2026-04-12T04:41:31Z

solr/core/src/java/org/apache/solr/update/processor/ContentHashVersionProcessorFactory.java

+import org.apache.solr.response.SolrQueryResponse;
+import org.apache.solr.util.plugin.SolrCoreAware;
+
+/** Factory for {@link ContentHashVersionProcessor} instances. */


This cool new URP should be documented here: solr/solr-ref-guide/modules/configuration-guide/pages/update-request-processors.adoc which will basically be a link to these javadocs. But these javadocs are looking sparse...

Another thing worth documenting is that this URP should be configured to exclude fields updated with in-place updates.

dsmiley · 2026-04-12T05:07:31Z

solr/core/src/java/org/apache/solr/update/processor/ContentHashVersionProcessor.java

+        super.processAdd(cmd);
+        return;
+      } catch (SolrException e) {
+        if (e.code() != 409) {


Why would we get a 409? I'm guessing you were taking inspiration from DocBasedVersionConstraintsProcessor but that guy is using optimistic concurrency by setting the internal version, which this new URP here doesn't do. And we don't need to, as we are merely avoiding a redundant operation (an add of a doc that is identical to one already present) whereas DBVCP is potentially adding a document with a new version conditionally based on that it saw an existing doc with an old version.

dsmiley · 2026-04-12T05:09:04Z

solr/core/src/java/org/apache/solr/update/processor/ContentHashVersionProcessor.java

+          rsp.addToLog("contentHash.duplicatesDetected", sameCount);
+        }
+        rsp.addToLog("contentHash.changed", differentCount);
+      } else {


It's not uncommon for update batches to have no documents -- the ones that merely send a "commit", for example, or maybe just deletes. IMO we shouldn't bother logging if there were no documents in the batch.

dsmiley · 2026-04-12T05:09:55Z

solr/core/src/java/org/apache/solr/update/processor/ContentHashVersionProcessor.java

+        if (dropSameDocuments) {
+          rsp.addToLog("contentHash.duplicatesDropped", sameCount);
+          rsp.addToLog("contentHash.duplicatesDetected", sameCount);
+        } else {
+          rsp.addToLog("contentHash.duplicatesDropped", 0);
+          rsp.addToLog("contentHash.duplicatesDetected", sameCount);
+        }


I propose that we either log duplicatesDropped or duplicatesDetected as appropriate. Doing both doesn't add information.

dsmiley · 2026-04-12T05:10:50Z

solr/core/src/java/org/apache/solr/update/processor/ContentHashVersionProcessor.java

+          rsp.addToLog("contentHash.duplicatesDropped", 0);
+          rsp.addToLog("contentHash.duplicatesDetected", sameCount);
+        }
+        rsp.addToLog("contentHash.changed", differentCount);


In your opinion, is this useful?

dsmiley · 2026-04-12T05:12:38Z

solr/core/src/java/org/apache/solr/update/processor/ContentHashVersionProcessor.java

+        .forEach(
+            fieldName -> {
+              sig.add(fieldName);
+              Object o = doc.getFieldValue(fieldName);


If we streamed SolrInputField instead of the field names, we wouldn't have to do this lookup. It's also not a big deal.

dsmiley · 2026-04-12T05:14:04Z

solr/core/src/java/org/apache/solr/update/processor/ContentHashVersionProcessor.java

+    }
+  }
+
+  void setDropSameDocuments(boolean dropSameDocuments) {


who calls this? A setter is surprising.

SOLR-18189: Improve tracking of duplicate Solr docs with content hash

cd914b7

https://issues.apache.org/jira/browse/SOLR-18189

github-actions bot added tests cat:index labels Apr 3, 2026

fhuaulme-sfdc added 2 commits April 3, 2026 17:42

* lint fixes

5277652

* java lint fixes

cd3b510

dsmiley reviewed Apr 3, 2026

View reviewed changes

dsmiley changed the title ~~SOLR-18189: Improve tracking of duplicate Solr docs with content hash~~ SOLR-18189: Skip indexing duplicate docs via a content hash URP Apr 7, 2026

github-actions bot added the test-framework label Apr 9, 2026

(review)

566f109

=URP= * Remove while(true), move to for-i loop

fhuaulme requested a review from dsmiley April 10, 2026 16:25

Use BinaryField; remove base64 / StringField usage

d3a4235

dsmiley reviewed Apr 12, 2026

View reviewed changes

Conversation

fhuaulme commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Solution

Tests

Checklist

Uh oh!

dsmiley left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dsmiley commented Apr 11, 2026

Uh oh!

dsmiley left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fhuaulme commented Apr 3, 2026 •

edited

Loading