Skip to content

SOLR-18189: Skip indexing duplicate docs via a content hash URP#4263

Open
fhuaulme wants to merge 6 commits intoapache:mainfrom
fhuaulme:jira/SOLR-18189
Open

SOLR-18189: Skip indexing duplicate docs via a content hash URP#4263
fhuaulme wants to merge 6 commits intoapache:mainfrom
fhuaulme:jira/SOLR-18189

Conversation

@fhuaulme
Copy link
Copy Markdown

@fhuaulme fhuaulme commented Apr 3, 2026

https://issues.apache.org/jira/browse/SOLR-18189

Description

Solr may receive similar documents for updates. Depending on use case, these updates may be unnecessary so discard such updates would be beneficial.

Solution

Content hash detection aims to improve update efficiency by identifying and bypassing redundant document updates. By detecting instances where content remains unchanged, the URP chain skips unnecessary write operations to reduce CPU and I/O overhead on nodes by skipping operations done in downstream URPs.

Tests

  • Unit tests added for URP and URP factory (namely org.apache.solr.update.processor.ContentHashVersionProcessorTest and org.apache.solr.update.processor.ContentHashVersionProcessorFactoryTest).

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended, not available for branches on forks living under an organisation)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.
  • I have added documentation for the Reference Guide
  • I have added a changelog entry for my change

Copy link
Copy Markdown
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review...

import java.util.function.Predicate;

/**
* An implementation of {@link UpdateRequestProcessor} which computes a hash of selected doc values, and uses this hash
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"doc values" here could be confusing... as Lucene/Solr has "docValues" and this URP in fact uses that technical mechanism.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a look at DocValues and I'm not sure it applies to this URP. In there a wording you prefer? (like "field values" ?)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, "field values".

This PR Is using Lucene docValues, but it's not plainly clear to non-Lucene/Solr experts.
org.apache.solr.handler.component.RealTimeGetComponent#getInputDocument fetches the document from the updateLog or the index -- and if it does fetch it from the index, the values will come from org.apache.solr.search.SolrDocumentFetcher that will in turn fetch it from DocValues if docValues=true in the schema. SDF javadocs say so.

return true; // Doc not found
}

private DocFoundAndOldUserAndSolrVersions getOldUserVersionsFromStored(BytesRef indexedDocId)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should not support this. It's not unreasonable to demand that someone using this feature enable docValues on the hash field.

}

public String computeDocHash(SolrInputDocument doc) {
List<String> docIncludedFieldNames =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This results in a double-navigation of the fields of a document in order to have the hash be insensitive to the order of fields added. But in practice, a given end-to-end pipeline is going to be deterministic. Hmm. I suppose its no big deal. To improve, you cold instead stream SolrInputField, sorting by name, and get the value from them without another lookup. You could also do the whole thing as a stream pipeline without needlessing producing a List. It'd end in a forEach to process the value.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I'm changing this to remove the double loop on the doc's fields and use Stream APIs.

}
}

private static class DocFoundAndOldUserAndSolrVersions {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be a record? Or simply an Optional of the oldHash?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I'm going to change this to Optional<String>

.encodeToString(signature); // Makes a base64 hash out of signature value
}

interface OldDocProvider {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

realistically, what impls are we going to have?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last remaining use case for this interface is a way to ease unit test of the URP. One other possibility could be to use an actual URP chain in the test and add actual docs, do you recommend adding a new test URP chain for these tests?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer an urp chain. Perhaps look at DocBasedVersionConstraintsProcessor's tests for inspiration, as that is this URPs closest cousin. More like twin brother -- very similar.

} finally {
rsp.addToLog("numAddsExisting", sameCount + differentCount + unknownCount);
rsp.addToLog("numAddsExistingWithIdentical", sameCount);
rsp.addToLog("numAddsExistingUnknown", unknownCount);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RE "numAdds" -- lets use "contentHash" prefix so as to associate these with this URP, and not confusing it with DocBasedVersionConstraintsProcessor which ought to include similar logs but doesn't yet do so.

IMO adding a new doc shouldn't add any log. I would expect one of these to incorporate the word "drop" or "discard" to reflect an action taken. I understand this URP can be configured to not take that action, albeit that would be in an exploratory / trial situation that ought to be temporary (I imagine).

this.discardSameDocuments = discardSameDocuments;
}

private boolean validateHash(BytesRef indexedDocId, String newHash) throws IOException {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should align with a similar name in DocBasedVersionConstraintsProcessor, and rename to hashIsAcceptable ?

private static final char SEPARATOR = ','; // Separator for included/excluded fields
static final String CONTENT_HASH_ENABLED_PARAM = "contentHashEnabled";
private List<String> includeFields =
Collections.singletonList("*"); // Included fields defaults to 'all'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modern Java prefers List.of

Collections.singletonList("*"); // Included fields defaults to 'all'
private List<String> excludeFields = new ArrayList<>();
// No excluded field by default, yet hashFieldName is excluded by default
private String hashFieldName =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arguably we should insist this be provided rather than some auto-magic name

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this parameter mandatory.

private boolean validateHash(BytesRef indexedDocId, String newHash) throws IOException {
assert null != indexedDocId;

var docFoundAndOldUserVersions = getOldUserVersionsFromStored(indexedDocId);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one possible algorithm -- lookup hash from docValues of a doc that we got by first looking up the current doc by ID (a terms index lookup). But this is ~twice as expensive as doing a terms index lookup of the new hash, assuming we find no doc with that hash. If we do find one, then it's similar, as we then need to fetch the ID via docValues to see if it's the same doc. But we're about to drop the doc then, which isn't the common path. Even "need" there is debatable; it'd be extremely unlikely to find a match by content hash to a different doc ID.

Basically -- does the content hash field have indexed=true XOR docValues=true. We don't need to support both (IMO) -- we can choose what's most efficient for this use-case.
I understand you may not have time to implement another algorithm.

Just an FYI: the postingsFormat (of the terms index) is configurable and there are a number of lovely implementations, incuding a bloom filter one, which would make a lot of sense for "lookup" / exact match only use-case. Such a choice is up to the user to configure in the schema if they wish.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wait... docs that have been but not yet visible via the IndexSearcher will not have their content hashes searchable. Albeit in such a situation, the "worst" thing that would happen is we index a doc that could have been dropped/discarded -- a missed optimization opportunity.

A terms lookup on the hash should be much faster than the RTG + double-docValues lookups though.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could emulate the approach of DocBasedVersionConstraintsProcessor by first consulting RealTimeGetComponent.getInputDocumentFromTlog (which is really quite cheap, especially, when empty). If there, we have the hash. If not, do the terms index lookup into the index via realtime searcher.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could emulate the approach of DocBasedVersionConstraintsProcessor by first consulting RealTimeGetComponent.getInputDocumentFromTlog (which is really quite cheap, especially, when empty). If there, we have the hash. If not, do the terms index lookup into the index via realtime searcher.

This is what org.apache.solr.handler.component.RealTimeGetComponent#getInputDocument is doing (it first checks in tlog).

@dsmiley dsmiley changed the title SOLR-18189: Improve tracking of duplicate Solr docs with content hash SOLR-18189: Skip indexing duplicate docs via a content hash URP Apr 7, 2026
=URP=
* Remove OldDocProvider internal interface
* Rework responses params (to include URP name)
* Rework hash computation to prevent successive loops over fields
* Hash field needs to be explicitly specified in config
* Removes per request enablement of URP

=Javadoc/naming=
* Take into account doc values/field values in javadoc
* Prefer "drop" over "discard" in doc/variables/method names
* Removes remaining Collections.singletonList()

=Unit test=
* Adds utility method in SolrTestCaseJ4 to get update response
* Rework unit test to use actual URP chain, remove mocks
* Improve test coverage for URP factory parameters (excluded fields)
* Improve test coverage for URP (drop, log modes, multi-valued fields)
=URP=
* Remove while(true), move to for-i loop
@fhuaulme fhuaulme requested a review from dsmiley April 10, 2026 16:25
@dsmiley
Copy link
Copy Markdown
Contributor

dsmiley commented Apr 11, 2026

I replaced the String/base64 usage with bytes & BinaryField.

I asked Claude: Compare this feature to DocBasedVersionConstraintsProcessor in terms of implementation. As a code reviewer, are we missing something here?


  1. Missing isNotLeader check (significant)

DocBasedVersionConstraintsProcessor.processAdd immediately bails on non-leader nodes:
if (isNotLeader(cmd)) {
super.processAdd(cmd);
return;
}
ContentHashVersionProcessor has no such guard. In a distributed cluster, every replica would also call RealTimeGetComponent.getInputDocument on every update — wasteful and potentially wrong
during REPLAY/PEER_SYNC phases.

  1. Double ++i in retry loop (bug)

Current code:
for (int i = 0; ; i++) { // i++ here
...
} catch (SolrException e) {
if (e.code() != 409) throw e;
++i; // second increment — wrong
}
}
DocBasedVersionConstraintsProcessor uses continue instead, letting only the loop header increment. The logOverlyFailedRetries mask (i & 255) == 255 fires at wrong cadence.

  1. Minor: hashField not from actual schema

DocBasedVersionConstraintsProcessor does schema.getField(...) so the SchemaField reflects actual configured properties. ContentHashVersionProcessor constructs new SchemaField(hashFieldName, new
BinaryField()) with default properties. Fine for now since only .getName() is used, but if someone later uses hashField for type dispatch they'll get wrong results.

Copy link
Copy Markdown
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DocBasedVersionConstraintsProcessor insists it's before DistributedUpdateProcessor and I'm sure it has to do with the optimistic concurrency related to its use of _version_ (DUP is where that's processed / honored). This means it needs the isLeader check as well.

I think a content hash URP is most easily/clearly placed someplace after DUP.
I suspect a content hash URP could usefully be on either side... but if it is placed before DURP then it needs the leader check similarly to DBVCP. If it's after, then I don't see a leader check being needed. So lets do that?

Furthermore, DUP resolves partial updates (internally called "atomic updates") to complete documents; we don't want this URP seeing a partial update. All the more reason to place this after DUP.

This matter is something to document to people know where to place this in their chain.

import org.apache.solr.response.SolrQueryResponse;
import org.apache.solr.util.plugin.SolrCoreAware;

/** Factory for {@link ContentHashVersionProcessor} instances. */
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cool new URP should be documented here: solr/solr-ref-guide/modules/configuration-guide/pages/update-request-processors.adoc which will basically be a link to these javadocs. But these javadocs are looking sparse...

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing worth documenting is that this URP should be configured to exclude fields updated with in-place updates.

super.processAdd(cmd);
return;
} catch (SolrException e) {
if (e.code() != 409) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would we get a 409? I'm guessing you were taking inspiration from DocBasedVersionConstraintsProcessor but that guy is using optimistic concurrency by setting the internal version, which this new URP here doesn't do. And we don't need to, as we are merely avoiding a redundant operation (an add of a doc that is identical to one already present) whereas DBVCP is potentially adding a document with a new version conditionally based on that it saw an existing doc with an old version.

rsp.addToLog("contentHash.duplicatesDetected", sameCount);
}
rsp.addToLog("contentHash.changed", differentCount);
} else {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not uncommon for update batches to have no documents -- the ones that merely send a "commit", for example, or maybe just deletes. IMO we shouldn't bother logging if there were no documents in the batch.

Comment on lines +121 to +127
if (dropSameDocuments) {
rsp.addToLog("contentHash.duplicatesDropped", sameCount);
rsp.addToLog("contentHash.duplicatesDetected", sameCount);
} else {
rsp.addToLog("contentHash.duplicatesDropped", 0);
rsp.addToLog("contentHash.duplicatesDetected", sameCount);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose that we either log duplicatesDropped or duplicatesDetected as appropriate. Doing both doesn't add information.

rsp.addToLog("contentHash.duplicatesDropped", 0);
rsp.addToLog("contentHash.duplicatesDetected", sameCount);
}
rsp.addToLog("contentHash.changed", differentCount);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In your opinion, is this useful?

.forEach(
fieldName -> {
sig.add(fieldName);
Object o = doc.getFieldValue(fieldName);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we streamed SolrInputField instead of the field names, we wouldn't have to do this lookup. It's also not a big deal.

}
}

void setDropSameDocuments(boolean dropSameDocuments) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

who calls this? A setter is surprising.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants