Skip to content

Nested sourced_from support#1216

Draft
ellisandrews-toast wants to merge 4 commits into
block:mainfrom
ellisandrews-toast:nested-sourced-from
Draft

Nested sourced_from support#1216
ellisandrews-toast wants to merge 4 commits into
block:mainfrom
ellisandrews-toast:nested-sourced-from

Conversation

@ellisandrews-toast

Copy link
Copy Markdown
Collaborator

TODO

@myronmarston myronmarston left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this spike up. I don't have any major concerns--the architecture and approach is sound. I left some comments throughout.

Looking forward to seeing individual PRs spun off this!

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On past projects, I've found it useful to update the schema artifacts structure (e.g. to add new fields like you are here) as a stand-alone PR before anything uses the new fields. Consider doing that for your PR stack.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

@parent_relationship_config = {
parent_type_name: parent_type_name,
parent_relationship_name: parent_relationship_name
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's define a Data class for this config. A hash is quick-and-dirty but is more prone to silently ignoring mispelled keys, etc--data_hash[:mispelled] returns nil whereas data_object.mispelled throws an exception.

The data class could be defined within this relationship class, e.g.:

class Relationship < DelegateClass(Field)
  # ...

  ParentRef = ::Data.define(:type_ref, :relationship_name)
end

Also, instead of just storing the parent type name as a string, can we store it as a TypeRef? While we always accept types as strings via the APIs called by EG users, we generally convert them to type refs early on because type refs are more capable than strings :).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented!


# @return [Hash, nil] configuration for parent relationship in a nested sourced_from chain
# @private
attr_reader :parent_relationship_config

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we call this parent_relationship_ref? To me it's less configuration and more just a reference to the parent relationship.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Results class is pretty large and I'd like to keep it from getting larger. It'd be nice to put your logic into another class which Results can delegate to. Actually, maybe we should preemptively extract identify_extra_update_targets_by_object_type_name into a separate class?

Maybe we can move that into SchemaUpdateTargetResolver which sits besides https://github.com/ellisandrews-toast/elasticgraph/blob/f5b4e7bce0b55bb86254898d83a336f52be2358d/elasticgraph-schema_definition/lib/elastic_graph/schema_definition/indexing/update_target_resolver.rb ?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this into a class called SourcedUpdateTargetsResolver. That class finds all sourced_from relationships (both top-level and nested), resolves them into UpdateTarget objects.


# Identifies update targets for sourced_from fields on non-indexed embedded types
# that use parent_relationship chains.
def identify_nested_sourced_update_targets(object_type, extra_update_targets_by_type_name, errors)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this logic be moved into https://github.com/ellisandrews-toast/elasticgraph/blob/f5b4e7bce0b55bb86254898d83a336f52be2358d/elasticgraph-schema_definition/lib/elastic_graph/schema_definition/indexing/update_target_resolver.rb? IIRC that's the main spot where the update target logic for a single object type lives...which seems to be what you're working with here.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this into the new SourcedUpdateTargetsResolver that delegates to both UpdateTargetResolver (for top-level) and NestedUpdateTargetResolver (for nested).

end

# Find the parent type
parent_type = @schema_def_state.object_types_by_name[config[:parent_type_name]]

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of looking up the type like this, if you use a type ref you can just do type_ref.as_object_type.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — storing a TypeRef on ParentRef and using ref.type_ref.as_object_type for lookup.


// Splits a composite nested element key into a list of parts.
List splitNestedElementKey(String nestedElementKey) {
return Arrays.asList(nestedElementKey.splitOnToken(":"));

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In another comment I mentioned the danger of assuming : won't be in any keys. Since we can pass arbitrary JSON in our parameters, can we just pass this as a list?

Or, if you are encoding/decoding a list of strings for storage in a map key can you encode it as a JSON string (being careful to apply some "canonical" formatting)?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per a previous comment above, I struggled to come up with a good solution here (but I bet we can find one). Implemented a naive length:value encoding for now that technically works.

Tried using Lists as map keys (which works in-memory in Painless), but it breaks when the document is serialized to JSON for storage and reloaded — JSON object keys must be strings.

Can spend more time on this aspect later.

parts.add(segment.get("object"));
}
}
return String.join(":", parts);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In another comment I mentioned the danger of assuming : won't be in any keys. We should figure out a way to encode nested keys that avoids making assumptions about what could be in the key values...

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, agreed. Mentioned some thoughts and current workaround in other comments.

"because this element was previously sourced from a different event (" + previousSourceIds + "). " +
"Each nested element can only be sourced from one source document."
);
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/nit can we unify these two exceptions messages into one message that makes sense for both cases?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}

// Applies nested sourced data from the __nested_sourced_data buffer to matched nested elements.
// Reads path config from the document itself — no external params needed.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's interesting--dose that mean we are storing some static path config on each document? Seems like that would be inefficient to store the same path config on billions of docs rather than pass it in as params...

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed on Zoom, was doing this because it was awkward to thread this static config through to the script the "normal" way.

I decided to instead store this on the IndexDefinition (the per-index runtime metadata) and pass it to the painless script as a param (we are already doing this kind of thing for __counts). I think this makes sense because the path config is static configuration that describes the structure of nested relationships within an index — it's the same for every document in that index, and needs to be known by any update event targeting that index. The script reads it from params.nestedSourcedPaths at execution time. The __nested_sourced_data structure on the document now only stores the actual sourced field values (the data that varies per nested element), not the path navigation config.

[nil, errors]
else
# Register the path config on the destination index so it's available at runtime.
resolved_chain.root_indexed_type.index_def.register_nested_sourced_paths(relationship.name, nested_sourced_paths)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, a method either computes a result and returns it or mutates some state, but doesn't do both. I find it makes harder to reason about code when a method which looks like it just computes and returns a result also has some hidden side effect.

Since resolve primarily returns a computed value, can we find a simple way to move this registration into the caller? If it can't easily be done in the caller based on the returned value, one potential solution is to use yield here:

yield resolved_chain.root_indexed_type, relationship, nested_sourced_paths

Then the caller can do the registration in a passed block:

result = resolve do |indexed_type, relationship, nested_sourced_paths|
  indexed_type.index_def.register_nested_sourced_paths(relationship.name, nested_sourced_paths)
end

That at least makes the side effect visible at the callsite so it's not a hidden surprise.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just pushed a commit implementing this way with the yield.

# so it's important we avoid it.
return initial_params unless update_target.for_normal_indexing?

initial_params["nestedSourcedPaths"] = destination_index_def.nested_sourced_paths.transform_values { |segments| segments.map(&:to_dumpable_hash) }

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The transform_values { |segments| segments.map(&:to_dumpable_hash) } computation is being re-done for every single ingested event even though it produces the same result every time. It would be nice for the param form to be directly available off of destination_index_def so we don't need to recompute here.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also to_dumpable_hash is specifically for dumping a runtime metadata object as YAML. The representation we want in YAML isn't necessarily the representation we want to pass to the painless script in params. Can we add to_painless_param to the runtime metadata objects and use that here? to_painless_param can just be an alias for to_dumpable_hash. (You can even use alias_method for it).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just pushed a commit addressing these 2 things (computing the script params once and storing it on DatastoreCore::IndexDefinition, and doing that via adding the to_painless_param alias method)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants