Nested sourced_from support by ellisandrews-toast · Pull Request #1216 · block/elasticgraph

ellisandrews-toast · 2026-05-28T14:31:32Z

TODO

…ough self-update target

myronmarston

Thanks for putting this spike up. I don't have any major concerns--the architecture and approach is sound. I left some comments throughout.

Looking forward to seeing individual PRs spun off this!

myronmarston · 2026-05-28T20:06:51Z

On past projects, I've found it useful to update the schema artifacts structure (e.g. to add new fields like you are here) as a stand-alone PR before anything uses the new fields. Consider doing that for your PR stack.

myronmarston · 2026-05-28T20:30:57Z

+          @parent_relationship_config = {
+            parent_type_name: parent_type_name,
+            parent_relationship_name: parent_relationship_name
+          }


Let's define a Data class for this config. A hash is quick-and-dirty but is more prone to silently ignoring mispelled keys, etc--data_hash[:mispelled] returns nil whereas data_object.mispelled throws an exception.

The data class could be defined within this relationship class, e.g.:

class Relationship < DelegateClass(Field) # ... ParentRef = ::Data.define(:type_ref, :relationship_name) end

Also, instead of just storing the parent type name as a string, can we store it as a TypeRef? While we always accept types as strings via the APIs called by EG users, we generally convert them to type refs early on because type refs are more capable than strings :).

Implemented!

myronmarston · 2026-05-28T20:31:47Z


+        # @return [Hash, nil] configuration for parent relationship in a nested sourced_from chain
+        # @private
+        attr_reader :parent_relationship_config


Can we call this parent_relationship_ref? To me it's less configuration and more just a reference to the parent relationship.

myronmarston · 2026-05-28T20:39:55Z

The Results class is pretty large and I'd like to keep it from getting larger. It'd be nice to put your logic into another class which Results can delegate to. Actually, maybe we should preemptively extract identify_extra_update_targets_by_object_type_name into a separate class?

Maybe we can move that into SchemaUpdateTargetResolver which sits besides https://github.com/ellisandrews-toast/elasticgraph/blob/f5b4e7bce0b55bb86254898d83a336f52be2358d/elasticgraph-schema_definition/lib/elastic_graph/schema_definition/indexing/update_target_resolver.rb ?

Moved this into a class called SourcedUpdateTargetsResolver. That class finds all sourced_from relationships (both top-level and nested), resolves them into UpdateTarget objects.

myronmarston · 2026-05-28T20:41:06Z

+
+      # Identifies update targets for sourced_from fields on non-indexed embedded types
+      # that use parent_relationship chains.
+      def identify_nested_sourced_update_targets(object_type, extra_update_targets_by_type_name, errors)


Can this logic be moved into https://github.com/ellisandrews-toast/elasticgraph/blob/f5b4e7bce0b55bb86254898d83a336f52be2358d/elasticgraph-schema_definition/lib/elastic_graph/schema_definition/indexing/update_target_resolver.rb? IIRC that's the main spot where the update target logic for a single object type lives...which seems to be what you're working with here.

Moved this into the new SourcedUpdateTargetsResolver that delegates to both UpdateTargetResolver (for top-level) and NestedUpdateTargetResolver (for nested).

myronmarston · 2026-05-28T20:56:13Z

+            end
+
+            # Find the parent type
+            parent_type = @schema_def_state.object_types_by_name[config[:parent_type_name]]


Instead of looking up the type like this, if you use a type ref you can just do type_ref.as_object_type.

Done — storing a TypeRef on ParentRef and using ref.type_ref.as_object_type for lookup.

myronmarston · 2026-05-28T20:58:20Z

+
+// Splits a composite nested element key into a list of parts.
+List splitNestedElementKey(String nestedElementKey) {
+  return Arrays.asList(nestedElementKey.splitOnToken(":"));


In another comment I mentioned the danger of assuming : won't be in any keys. Since we can pass arbitrary JSON in our parameters, can we just pass this as a list?

Or, if you are encoding/decoding a list of strings for storage in a map key can you encode it as a JSON string (being careful to apply some "canonical" formatting)?

Per a previous comment above, I struggled to come up with a good solution here (but I bet we can find one). Implemented a naive length:value encoding for now that technically works.

Tried using Lists as map keys (which works in-memory in Painless), but it breaks when the document is serialized to JSON for storage and reloaded — JSON object keys must be strings.

Can spend more time on this aspect later.

myronmarston · 2026-05-28T21:26:33Z

+      parts.add(segment.get("object"));
+    }
+  }
+  return String.join(":", parts);


In another comment I mentioned the danger of assuming : won't be in any keys. We should figure out a way to encode nested keys that avoids making assumptions about what could be in the key values...

Yes, agreed. Mentioned some thoughts and current workaround in other comments.

myronmarston · 2026-05-28T21:28:03Z

+        "because this element was previously sourced from a different event (" + previousSourceIds + "). " +
+        "Each nested element can only be sourced from one source document."
+      );
+    }


/nit can we unify these two exceptions messages into one message that makes sense for both cases?

myronmarston · 2026-05-28T21:29:45Z

+}
+
+// Applies nested sourced data from the __nested_sourced_data buffer to matched nested elements.
+// Reads path config from the document itself — no external params needed.


That's interesting--dose that mean we are storing some static path config on each document? Seems like that would be inefficient to store the same path config on billions of docs rather than pass it in as params...

As discussed on Zoom, was doing this because it was awkward to thread this static config through to the script the "normal" way.

I decided to instead store this on the IndexDefinition (the per-index runtime metadata) and pass it to the painless script as a param (we are already doing this kind of thing for __counts). I think this makes sense because the path config is static configuration that describes the structure of nested relationships within an index — it's the same for every document in that index, and needs to be known by any update event targeting that index. The script reads it from params.nestedSourcedPaths at execution time. The __nested_sourced_data structure on the document now only stores the actual sourced field values (the data that varies per nested element), not the path navigation config.

myronmarston · 2026-06-01T16:05:22Z

+            [nil, errors]
+          else
+            # Register the path config on the destination index so it's available at runtime.
+            resolved_chain.root_indexed_type.index_def.register_nested_sourced_paths(relationship.name, nested_sourced_paths)


Ideally, a method either computes a result and returns it or mutates some state, but doesn't do both. I find it makes harder to reason about code when a method which looks like it just computes and returns a result also has some hidden side effect.

Since resolve primarily returns a computed value, can we find a simple way to move this registration into the caller? If it can't easily be done in the caller based on the returned value, one potential solution is to use yield here:

yield resolved_chain.root_indexed_type, relationship, nested_sourced_paths

Then the caller can do the registration in a passed block:

result = resolve do |indexed_type, relationship, nested_sourced_paths| indexed_type.index_def.register_nested_sourced_paths(relationship.name, nested_sourced_paths) end

That at least makes the side effect visible at the callsite so it's not a hidden surprise.

Just pushed a commit implementing this way with the yield.

myronmarston · 2026-06-01T16:10:25Z

+          # so it's important we avoid it.
          return initial_params unless update_target.for_normal_indexing?

+          initial_params["nestedSourcedPaths"] = destination_index_def.nested_sourced_paths.transform_values { |segments| segments.map(&:to_dumpable_hash) }


The transform_values { |segments| segments.map(&:to_dumpable_hash) } computation is being re-done for every single ingested event even though it produces the same result every time. It would be nice for the param form to be directly available off of destination_index_def so we don't need to recompute here.

Also to_dumpable_hash is specifically for dumping a runtime metadata object as YAML. The representation we want in YAML isn't necessarily the representation we want to pass to the painless script in params. Can we add to_painless_param to the runtime metadata objects and use that here? to_painless_param can just be an alias for to_dumpable_hash. (You can even use alias_method for it).

Just pushed a commit addressing these 2 things (computing the script params once and storing it on DatastoreCore::IndexDefinition, and doing that via adding the to_painless_param alias method)

ellisandrews-toast added 2 commits May 28, 2026 10:17

Implement nested sourced_from support

8ca4bd1

Store nested sourced path config in document instead of threading thr…

5c503ba

…ough self-update target

myronmarston reviewed May 28, 2026

View reviewed changes

Address PR feedback: refactor nested sourced_from implementation

df5f1c4

ellisandrews-toast force-pushed the nested-sourced-from branch from 0d96e36 to df5f1c4 Compare May 29, 2026 23:25

myronmarston reviewed Jun 1, 2026

View reviewed changes

Address PR feedback: separate concerns in nested sourced_from

a1a49f6

Conversation

ellisandrews-toast commented May 28, 2026

Uh oh!

myronmarston left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants