NIFI-16028 - Support nested field paths in PutElasticsearchJson#11355
NIFI-16028 - Support nested field paths in PutElasticsearchJson#11355agturley wants to merge 2 commits into
Conversation
| .name("Identifier Field") | ||
| .description(""" | ||
| The name of the field within each document to use as the Elasticsearch document ID. \ | ||
| A nested field can be referenced with a "/"-delimited path (e.g. "@metadata/id"). \ |
There was a problem hiding this comment.
Could you rebase this branch on the latest main? The validation gate is failing on "branch not up to date", and because of that the full build and test matrix was skipped, so we do not have a green CI signal yet.
| /** | ||
| * Whether {@code field} is a {@code /}-delimited nested path rather than a flat field name. | ||
| */ | ||
| private static boolean isNestedPath(final String field) { |
There was a problem hiding this comment.
A field key that literally contains a slash is now always read as a nested path. Should we note this limitation in the property descriptions so users with such field names are not surprised?
There was a problem hiding this comment.
I've altered it to allow users to escape a backslash to allow a backslash in the path
| } | ||
|
|
||
| @Test | ||
| void testNestedIndexFieldKeepsParentWithRemainingSiblings() { |
There was a problem hiding this comment.
Could we add a test with a 3 level path (for example a/b/c) to lock in the recursive pruning of multiple empty ancestor objects?
| } | ||
|
|
||
| @Test | ||
| void testNestedIdentifierFieldNdjsonPrunesEmptyParent() { |
There was a problem hiding this comment.
Should we add a nested identifier test for an Update, Delete, or Upsert operation, so the resolveId nested path is covered as well?
There was a problem hiding this comment.
Added testNestedIdentifierFieldUpdateOperationResolvesAndPrunes — it covers the resolveId nested path via an Update op (@metadata/id), verifying the nested ID resolves and the empty parent is pruned. Also added an escaped-slash test on the same path.
Allow the Index Field, Identifier Field, and Timestamp Field properties to reference a nested field with a "/"-delimited path (e.g. "@metadata/index"). A value with no "/" behaves exactly as before (top-level lookup), so existing flows are unaffected. When the corresponding Retain property is false, the extracted leaf is removed and any ancestor object left empty is pruned recursively, so a transient envelope like "@metadata" is not indexed once its fields are consumed; parents that still have other fields are kept. A field name that literally contains a "/" can be addressed by escaping the slash as "\/" (and a literal backslash as "\\"); a backslash not followed by "/" or "\" is left untouched, so existing field names are unaffected. The escaping rule and worked nesting/escaping examples are documented in the Index, Identifier, and Timestamp Field property descriptions. Applies across NDJSON, JSON Array, and Single JSON, and all index operations. The NDJSON raw-bytes fast path is bypassed only when a nested path is configured, since the streaming scan only matches flat field names; the path classification and decoded flat names are resolved once per onTrigger rather than per record. Builds on NIFI-15985.
|
Do you think the description updates I provided are overkill? |
| .name("Identifier Field") | ||
| .description(""" | ||
| The name of the field within each document to use as the Elasticsearch document ID. \ | ||
| A nested field can be referenced with a "/"-delimited path: for a document \ |
There was a problem hiding this comment.
These three properties shipped in released 2.10.0 with literal field name semantics, so an existing flow that has a slash in the configured value will now be read as a nested path after an upgrade. I am not an experienced Elasticsearch user and I do not know how common a slash is in these field names, and I agree the likelihood of a problematic value is low, but I think we should still handle this case safely with a property migration that escapes any existing slash or backslash, and add test coverage for that migration. Could we add that here?
There was a problem hiding this comment.
So I started on this and ran into something I want to flag before going further.
The problem is migrateProperties runs on every flow import and restart, and there's no way to tell a 2.10 literal field name like @metadata/id apart from a 2.11 nested path @metadata/id — they're the exact same string. So if I just escape slashes, it also escapes legit nested paths and breaks them. I actually hit this on my test cluster: imported a normal 2.11 flow and every nested-path case quietly stopped working because the migration had rewritten @metadata/id to @metadata\/id on import.
The only way I found to make it safe is to rename the three properties (Identifier Field → Identifier Field Path, etc). Then the old name only exists in pre-2.11 flows, so I can escape values found under the old name and leave the new ones alone. That works, but it means renaming three properties everyone's already using.
This is feeling like a lot for the case we're protecting against — a 2.10 flow that happens to have a literal / in the field name, and slashes aren't really common in ES field names to begin with. So I'm inclined to skip the migration and just document it: on upgrade, a field name with a literal / gets read as a nested path now, and you can escape it as / if you want the old behavior.
Let me know if you'd rather I do the rename though, happy to if you think the upgrade guarantee is worth it.
There was a problem hiding this comment.
@ChrisSamo632 - thoughts on this? You're way more familiar with ES so your feedback would be appreciated if you have the chance
There was a problem hiding this comment.
I'm not aware of any restrictions around the use of / characters in field names in Elasticsearch or json in general, although I'd imagine the occurrence is relatively infrequent
Trying to think of examples, something like house_name/number could be the sort of thing someone might use, but I'd typically opt for an alternative in such a case as I'd see the / as unusual (and potentially problematic in some systems/parsers)
For this instance, would a workable option be to add a processor parameter that let's the user configure how a / should be treated in field names for this processor? The default could be to have the character treated as a literal, with a new mode that enables the nested object parsing?
Alternatively, would . be a better option as it matches the Elasticsearch nested query parsing, but then is further from the nifi record path parsing, which / emulates? The same concern would exist, although use of . in elasticsearch field names has been discouraged for a long time as it can confuse the Elasticsearch query parser - therefore, the chances of people using . litterals on field names seems lower
| .name("Index Field") | ||
| .description(""" | ||
| The name of the field within each document to use as the Elasticsearch index name. \ | ||
| A nested field can be referenced with a "/"-delimited path: for a document \ |
There was a problem hiding this comment.
The escape explanation is repeated in all three property descriptions. Could we move the full explanation into an additionalDetails.md for the processor and have each property description point to the processor documentation for the details instead?
| * metadata, e.g. {@code @metadata}, should not be indexed once its fields are extracted). | ||
| * Does nothing when the path is absent. A path with no slash is a direct top-level removal. | ||
| */ | ||
| private static void removeAtPath(final Map<String, Object> root, final String path) { |
There was a problem hiding this comment.
The Map and ObjectNode versions of removeAtPath duplicate the same descend and prune loop. Could we factor the shared walk into one helper so the two cannot drift over time?
…dditionalDetails, add pass-through tests
Summary
NIFI-16028
Allow the Index Field, Identifier Field, and Timestamp Field properties to reference a nested
field with a "/"-delimited path (e.g. "@metadata/index"). A value with no "/" behaves exactly as
before (top-level lookup), so existing flows are unaffected.
When the corresponding Retain property is false, the extracted leaf is removed and any ancestor
object left empty is pruned, so a transient envelope like "@metadata" is not indexed once its
fields are consumed; parents that still have other fields are kept.
Applies across NDJSON, JSON Array, and Single JSON, and all index operations. The NDJSON
raw-bytes fast path is bypassed only when a nested path is configured, since the streaming scan
only matches flat field names.
Builds on NIFI-15985.
Tracking
Please complete the following tracking steps prior to pull request creation.
Issue Tracking
Pull Request Tracking
NIFI-00000NIFI-00000VerifiedstatusPull Request Formatting
mainbranchVerification
Please indicate the verification steps performed prior to pull request creation.
Build
./mvnw clean install -P contrib-checkLicensing
LICENSEandNOTICEfilesDocumentation