Skip to content

NIFI-16028 - Support nested field paths in PutElasticsearchJson#11355

Open
agturley wants to merge 2 commits into
apache:mainfrom
agturley:NIFI-16028
Open

NIFI-16028 - Support nested field paths in PutElasticsearchJson#11355
agturley wants to merge 2 commits into
apache:mainfrom
agturley:NIFI-16028

Conversation

@agturley

@agturley agturley commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Summary

NIFI-16028

Allow the Index Field, Identifier Field, and Timestamp Field properties to reference a nested
field with a "/"-delimited path (e.g. "@metadata/index"). A value with no "/" behaves exactly as
before (top-level lookup), so existing flows are unaffected.

When the corresponding Retain property is false, the extracted leaf is removed and any ancestor
object left empty is pruned, so a transient envelope like "@metadata" is not indexed once its
fields are consumed; parents that still have other fields are kept.

Applies across NDJSON, JSON Array, and Single JSON, and all index operations. The NDJSON
raw-bytes fast path is bypassed only when a nested path is configured, since the streaming scan
only matches flat field names.

Builds on NIFI-15985.

Tracking

Please complete the following tracking steps prior to pull request creation.

Issue Tracking

Pull Request Tracking

  • Pull Request title starts with Apache NiFi Jira issue number, such as NIFI-00000
  • Pull Request commit message starts with Apache NiFi Jira issue number, as such NIFI-00000
  • Pull request contains commits signed with a registered key indicating Verified status

Pull Request Formatting

  • Pull Request based on current revision of the main branch
  • Pull Request refers to a feature branch with one commit containing changes

Verification

Please indicate the verification steps performed prior to pull request creation.

Build

  • Build completed using ./mvnw clean install -P contrib-check
    • JDK 21
    • JDK 25

Licensing

  • New dependencies are compatible with the Apache License 2.0 according to the License Policy
  • New dependencies are documented in applicable LICENSE and NOTICE files

Documentation

  • Documentation formatting appears as expected in rendered files

.name("Identifier Field")
.description("""
The name of the field within each document to use as the Elasticsearch document ID. \
A nested field can be referenced with a "/"-delimited path (e.g. "@metadata/id"). \

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you rebase this branch on the latest main? The validation gate is failing on "branch not up to date", and because of that the full build and test matrix was skipped, so we do not have a green CI signal yet.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

/**
* Whether {@code field} is a {@code /}-delimited nested path rather than a flat field name.
*/
private static boolean isNestedPath(final String field) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A field key that literally contains a slash is now always read as a nested path. Should we note this limitation in the property descriptions so users with such field names are not surprised?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've altered it to allow users to escape a backslash to allow a backslash in the path

}

@Test
void testNestedIndexFieldKeepsParentWithRemainingSiblings() {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a test with a 3 level path (for example a/b/c) to lock in the recursive pruning of multiple empty ancestor objects?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}

@Test
void testNestedIdentifierFieldNdjsonPrunesEmptyParent() {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a nested identifier test for an Update, Delete, or Upsert operation, so the resolveId nested path is covered as well?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added testNestedIdentifierFieldUpdateOperationResolvesAndPrunes — it covers the resolveId nested path via an Update op (@metadata/id), verifying the nested ID resolves and the empty parent is pruned. Also added an escaped-slash test on the same path.

Allow the Index Field, Identifier Field, and Timestamp Field properties to
reference a nested field with a "/"-delimited path (e.g. "@metadata/index").
A value with no "/" behaves exactly as before (top-level lookup), so existing
flows are unaffected.

When the corresponding Retain property is false, the extracted leaf is removed
and any ancestor object left empty is pruned recursively, so a transient
envelope like "@metadata" is not indexed once its fields are consumed; parents
that still have other fields are kept.

A field name that literally contains a "/" can be addressed by escaping the
slash as "\/" (and a literal backslash as "\\"); a backslash not followed by
"/" or "\" is left untouched, so existing field names are unaffected. The
escaping rule and worked nesting/escaping examples are documented in the
Index, Identifier, and Timestamp Field property descriptions.

Applies across NDJSON, JSON Array, and Single JSON, and all index operations.
The NDJSON raw-bytes fast path is bypassed only when a nested path is
configured, since the streaming scan only matches flat field names; the path
classification and decoded flat names are resolved once per onTrigger rather
than per record.

Builds on NIFI-15985.
@agturley

agturley commented Jun 22, 2026

Copy link
Copy Markdown
Contributor Author

Do you think the description updates I provided are overkill?

.name("Identifier Field")
.description("""
The name of the field within each document to use as the Elasticsearch document ID. \
A nested field can be referenced with a "/"-delimited path: for a document \

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These three properties shipped in released 2.10.0 with literal field name semantics, so an existing flow that has a slash in the configured value will now be read as a nested path after an upgrade. I am not an experienced Elasticsearch user and I do not know how common a slash is in these field names, and I agree the likelihood of a problematic value is low, but I think we should still handle this case safely with a property migration that escapes any existing slash or backslash, and add test coverage for that migration. Could we add that here?

@agturley agturley Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I started on this and ran into something I want to flag before going further.

The problem is migrateProperties runs on every flow import and restart, and there's no way to tell a 2.10 literal field name like @metadata/id apart from a 2.11 nested path @metadata/id — they're the exact same string. So if I just escape slashes, it also escapes legit nested paths and breaks them. I actually hit this on my test cluster: imported a normal 2.11 flow and every nested-path case quietly stopped working because the migration had rewritten @metadata/id to @metadata\/id on import.

The only way I found to make it safe is to rename the three properties (Identifier Field → Identifier Field Path, etc). Then the old name only exists in pre-2.11 flows, so I can escape values found under the old name and leave the new ones alone. That works, but it means renaming three properties everyone's already using.

This is feeling like a lot for the case we're protecting against — a 2.10 flow that happens to have a literal / in the field name, and slashes aren't really common in ES field names to begin with. So I'm inclined to skip the migration and just document it: on upgrade, a field name with a literal / gets read as a nested path now, and you can escape it as / if you want the old behavior.

Let me know if you'd rather I do the rename though, happy to if you think the upgrade guarantee is worth it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChrisSamo632 - thoughts on this? You're way more familiar with ES so your feedback would be appreciated if you have the chance

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not aware of any restrictions around the use of / characters in field names in Elasticsearch or json in general, although I'd imagine the occurrence is relatively infrequent

Trying to think of examples, something like house_name/number could be the sort of thing someone might use, but I'd typically opt for an alternative in such a case as I'd see the / as unusual (and potentially problematic in some systems/parsers)

For this instance, would a workable option be to add a processor parameter that let's the user configure how a / should be treated in field names for this processor? The default could be to have the character treated as a literal, with a new mode that enables the nested object parsing?

Alternatively, would . be a better option as it matches the Elasticsearch nested query parsing, but then is further from the nifi record path parsing, which / emulates? The same concern would exist, although use of . in elasticsearch field names has been discouraged for a long time as it can confuse the Elasticsearch query parser - therefore, the chances of people using . litterals on field names seems lower

.name("Index Field")
.description("""
The name of the field within each document to use as the Elasticsearch index name. \
A nested field can be referenced with a "/"-delimited path: for a document \

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The escape explanation is repeated in all three property descriptions. Could we move the full explanation into an additionalDetails.md for the processor and have each property description point to the processor documentation for the details instead?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, done

* metadata, e.g. {@code @metadata}, should not be indexed once its fields are extracted).
* Does nothing when the path is absent. A path with no slash is a direct top-level removal.
*/
private static void removeAtPath(final Map<String, Object> root, final String path) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Map and ObjectNode versions of removeAtPath duplicate the same descend and prune loop. Could we factor the shared walk into one helper so the two cannot drift over time?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@agturley agturley requested a review from pvillard31 June 23, 2026 00:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants