feat(llm): expand graph extraction service APIs by LRriver · Pull Request #361 · apache/hugegraph-ai

LRriver · 2026-06-09T08:04:12Z

Summary

Extend the graph extraction API introduced by feat(llm): add /graph/extract API for programmatic graph extraction #351 with service-backed synchronous extraction, async jobs, graph import, and extract-and-import endpoints.
Add content_type/content support for raw text and pre-split chunks, request-bounded chunk parallelism, metadata, and structured error responses.
Harden schema validation, request-local HugeGraph client configuration, import result reporting, LLM config compatibility, and route registration.

Relation to #351

This builds on the initial synchronous POST /graph/extract endpoint from #351. The deprecated texts alias remains accepted: a string maps to content_type=text, and a list maps to pre-split chunks. Multi-document extraction remains caller-managed through multiple API requests instead of hidden batch semantics in the synchronous endpoint.

Write API Safety

/graph/import and /graph/extract-and-import require write_to_graph=true.
Inline schema writes require client_config.graph so the target graph is explicit in the request and response metadata.
Property-graph import payloads are validated at the request boundary before reaching HugeGraph.

Job Endpoint Notes

/graph/extract/jobs uses an in-memory, process-local job store.
Jobs and results are lost on service restart and are not shared across multiple API worker processes.
Cancellation only applies before a queued job starts; it cannot interrupt an active LLM call.

Tests

uv run ruff format --check .
uv run ruff check .
uv run pytest hugegraph-llm/src/tests/api/test_graph_extract_api.py hugegraph-llm/src/tests/api/test_graph_import_api.py hugegraph-llm/src/tests/api/test_graph_extract_jobs.py -q
SKIP_EXTERNAL_SERVICES=true uv run pytest hugegraph-llm/src/tests/api -v --tb=short
SKIP_EXTERNAL_SERVICES=true uv run pytest hugegraph-llm/src/tests/config/ hugegraph-llm/src/tests/document/ hugegraph-llm/src/tests/operators/ hugegraph-llm/src/tests/models/ hugegraph-llm/src/tests/indices/ hugegraph-llm/src/tests/test_utils.py -v --tb=short

Review

Addressed review feedback in 7850be7 and 0888546.

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR introduces a new graph extraction/import service layer and expands the FastAPI surface area with job-based extraction and import endpoints, while also making extraction behavior more configurable (chunk handling, split types, and parallel chunk processing) and improving robustness around malformed LLM output and import result reporting.

Changes:

Added GraphExtractService/GraphImportService plus request/response model updates, including redaction and schema normalization.
Added async-style job endpoints (/graph/extract/jobs/*) with an in-memory job store, plus new /graph/import and /graph/extract-and-import routes.
Updated extraction and import flows/operators to support configurable split types, pre-split chunks, parallel chunk extraction, and structured import stats.

Reviewed changes

Copilot reviewed 42 out of 42 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
hugegraph-llm/src/tests/utils/test_graph_index_utils.py	Adds regression coverage for `extract_graph()` scheduler call shape.
hugegraph-llm/src/tests/operators/llm_op/test_property_graph_extract.py	Adds coverage for parent edgelabel schemas, parallel chunk ordering, serial fallback, and malformed JSON handling.
hugegraph-llm/src/tests/operators/llm_op/test_info_extract.py	Adds coverage for regex extraction with schema “shape” normalization.
hugegraph-llm/src/tests/operators/hugegraph_op/test_commit_to_hugegraph_load_into_graph.py	Updates expectations to “continue + report import_result” rather than raise on create failures.
hugegraph-llm/src/tests/operators/hugegraph_op/test_commit_to_hugegraph.py	Adds broader import behavior coverage including counts, id mapping, and normalized extraction inputs.
hugegraph-llm/src/tests/operators/document_op/test_chunk_split.py	Adds paragraph boundary behavior tests for short paragraphs.
hugegraph-llm/src/tests/nodes/test_request_graph_config.py	Tests request-scoped graph config propagation into nodes/operators.
hugegraph-llm/src/tests/nodes/test_extract_node.py	Ensures ExtractNode uses extract-LLM config and wires max-parallel-chunks.
hugegraph-llm/src/tests/nodes/test_base_node.py	Adds coverage that unexpected operator exceptions become error statuses.
hugegraph-llm/src/tests/models/llms/test_init_llm.py	Adds coverage for extract-LLM config fallback behavior across providers.
hugegraph-llm/src/tests/flows/test_graph_extract_flow.py	Tests split/content-type defaults and state reset semantics.
hugegraph-llm/src/tests/document/test_graph_extract_configurable_split.py	Extends flow post-deal expectations to include max_parallel_chunks in output.
hugegraph-llm/src/tests/api/test_graph_import_api.py	Adds coverage for import + extract-and-import endpoints, client_config behavior, and embedding updates.
hugegraph-llm/src/tests/api/test_graph_extract_jobs.py	Adds end-to-end coverage for job creation, execution, cancellation, expiry, and concurrency.
hugegraph-llm/src/tests/api/test_graph_extract_api.py	Refactors API tests around service layer, adds concurrency isolation and structured error expectations.
hugegraph-llm/src/hugegraph_llm/utils/hugegraph_utils.py	Adds request-scoped `graph_config` support to client creation.
hugegraph-llm/src/hugegraph_llm/state/ai_state.py	Extends workflow input/state with `content_type`, `max_parallel_chunks`, and `graph_config`.
hugegraph-llm/src/hugegraph_llm/services/graph_extract_service.py	Introduces extraction/import services, schema normalization, redaction, and flow JSON validation.
hugegraph-llm/src/hugegraph_llm/services/graph_extract_jobs.py	Adds an in-memory job store with TTL, queueing, worker threads, and status transitions.
hugegraph-llm/src/hugegraph_llm/services/init.py	Adds services package marker.
hugegraph-llm/src/hugegraph_llm/operators/llm_op/property_graph_extract.py	Adds parallel chunk extraction and explicit malformed-JSON failure path.
hugegraph-llm/src/hugegraph_llm/operators/llm_op/info_extract.py	Adds schema shape normalization helper for regex extraction.
hugegraph-llm/src/hugegraph_llm/operators/index_op/build_semantic_index.py	Makes semantic index building respect request-scoped graph config.
hugegraph-llm/src/hugegraph_llm/operators/hugegraph_op/schema_manager.py	Adds `graph_config` support while retaining full “connection unit” behavior.
hugegraph-llm/src/hugegraph_llm/operators/hugegraph_op/commit_to_hugegraph.py	Adds request-scoped graph config and structured import_result counts/errors; continues on create failures.
hugegraph-llm/src/hugegraph_llm/operators/document_op/chunk_split.py	Adds paragraph-boundary splitting that preserves explicit paragraph breaks.
hugegraph-llm/src/hugegraph_llm/nodes/llm_node/extract_info.py	Switches to extract-LLM configuration and wires max-parallel-chunks into operator.
hugegraph-llm/src/hugegraph_llm/nodes/index_node/build_semantic_index.py	Passes request-scoped graph config into semantic index operator.
hugegraph-llm/src/hugegraph_llm/nodes/hugegraph_node/schema.py	Plumbs request graph_config into SchemaManager when not using full connection dict.
hugegraph-llm/src/hugegraph_llm/nodes/hugegraph_node/fetch_graph_data.py	Uses request-scoped graph_config for HugeGraph client creation.
hugegraph-llm/src/hugegraph_llm/nodes/hugegraph_node/commit_to_hugegraph.py	Creates Commit2Graph with request-scoped graph_config.
hugegraph-llm/src/hugegraph_llm/nodes/document_node/chunk_split.py	Skips splitting when content is already chunks; sets `context["chunks"]` directly.
hugegraph-llm/src/hugegraph_llm/nodes/base_node.py	Broadens exception handling to convert unexpected operator exceptions into error statuses.
hugegraph-llm/src/hugegraph_llm/models/llms/init_llm.py	Adds extract-LLM fallback rules (e.g., reuse chat config when extract config missing).
hugegraph-llm/src/hugegraph_llm/flows/update_vid_embeddings.py	Adds graph_config plumbing into the flow input.
hugegraph-llm/src/hugegraph_llm/flows/import_graph_data.py	Adds graph_config plumbing into the import flow input.
hugegraph-llm/src/hugegraph_llm/flows/graph_extract.py	Adds content_type/max_parallel_chunks parameters and enriches post-deal output.
hugegraph-llm/src/hugegraph_llm/demo/rag_demo/app.py	Adjusts route registration order with new graph endpoints.
hugegraph-llm/src/hugegraph_llm/config/llm_config.py	Adds global defaults/limits for graph-extract parallel chunk calls.
hugegraph-llm/src/hugegraph_llm/api/models/graph_extract_responses.py	Adds typed error/job/import response models.
hugegraph-llm/src/hugegraph_llm/api/models/graph_extract_requests.py	Redesigns request contract around `content_type`/`content`, adds parallelism validation and import request models.
hugegraph-llm/src/hugegraph_llm/api/graph_extract_api.py	Adds job endpoints, import endpoints, structured error semantics, and request validation wrapping.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

LRriver · 2026-06-09T08:40:13Z

+        try:
+            max_parallel_chunks = max(1, int(context.get("max_parallel_chunks") or self.max_parallel_chunks))
+        except (TypeError, ValueError):
+            max_parallel_chunks = max(1, self.max_parallel_chunks)
+        chunk_count = len(chunks)
+        worker_count = min(max_parallel_chunks, chunk_count)
+        context["max_parallel_chunks"] = worker_count
+        if worker_count <= 1:
+            proceeded_chunks = [self.extract_property_graph_by_llm(schema, chunk) for chunk in chunks]
+        else:
+            with ThreadPoolExecutor(max_workers=worker_count) as executor:
+                proceeded_chunks = list(
+                    executor.map(lambda chunk: self.extract_property_graph_by_llm(schema, chunk), chunks)
+                )


Fixed in 7850be7. Empty chunks now return without LLM calls and keep max_parallel_chunks metadata positive, with regression coverage.

LRriver · 2026-06-09T08:40:15Z

+            with ThreadPoolExecutor(max_workers=worker_count) as executor:
+                proceeded_chunks = list(
+                    executor.map(lambda chunk: self.extract_property_graph_by_llm(schema, chunk), chunks)
+                )


Not changed in this pass. The current LLM wrappers used here do not maintain per-request mutable response buffers in PropertyGraphExtract; serializing extract_property_graph_by_llm behind one lock would effectively disable the chunk-level parallelism this API is adding. The API also bounds per-request parallelism by config and request. If a future provider wrapper proves non-thread-safe, the better fix would be provider-local isolation rather than a global lock in the extraction operator.

LRriver · 2026-06-09T08:40:17Z

+    @router.post("/graph/import", status_code=status.HTTP_200_OK)
+    def graph_import_api(req: GraphImportRequest) -> GraphImportResponse:


Fixed in 7850be7. Added response_model declarations for job, import, and extract-and-import endpoints, with route registration coverage.

LRriver · 2026-06-09T08:40:20Z

+    @router.post("/graph/extract-and-import", status_code=status.HTTP_200_OK)
+    def graph_extract_and_import_api(req: GraphExtractAndImportRequest) -> GraphExtractAndImportResponse:


Fixed in 7850be7. Added response_model declarations for job, import, and extract-and-import endpoints, with route registration coverage.

LRriver · 2026-06-09T08:40:22Z

+        if not vertices and not edges and not triples:
            log.critical("(Loading) Both vertices and edges are empty. Please check the input data again.")
            raise ValueError("Both vertices and edges input are empty.")


Fixed in 7850be7. Empty input messages now mention vertices, edges, and triples. Schema-free mode rejects vertices or edges, and schema mode rejects triples so mixed inputs are not silently dropped. Added regression coverage.

LRriver · 2026-06-09T08:40:24Z

+            if not vertices and not edges:
+                log.critical("(Loading) Both vertices and edges are empty. Please check the input data again.")
+                raise ValueError("Both vertices and edges input are empty.")


Fixed in 7850be7. Empty input messages now mention vertices, edges, and triples. Schema-free mode rejects vertices or edges, and schema mode rejects triples so mixed inputs are not silently dropped. Added regression coverage.

LRriver · 2026-06-09T08:40:27Z

+            except RequestValidationError as exc:
+                return JSONResponse(
+                    status_code=status.HTTP_422_UNPROCESSABLE_ENTITY,
+                    content={
+                        "detail": _error(
+                            "GRAPH_EXTRACT_VALIDATION_ERROR",
+                            str(exc),
+                            "request",
+                        )
+                    },
+                )


Fixed in 7850be7. Validation responses now use sanitized loc, msg, and type summaries from exc.errors() and omit raw input values. Added coverage for password and URL not being echoed.

LRriver · 2026-06-22T07:31:43Z

Update pushed in 9cecb9b:

Sanitized update_vid_embeddings warning output so exception details are not returned to clients.
Added regression coverage for sensitive exception text not appearing in warnings.
Split graph route registration tests into test_graph_extract_routes.py; test_graph_extract_api.py is now below the 600-line guideline.

Local verification:

uv run ruff format --check .
uv run ruff check .
SKIP_EXTERNAL_SERVICES=true uv run pytest hugegraph-llm/src/tests/api -v --tb=short
SKIP_EXTERNAL_SERVICES=true uv run pytest hugegraph-llm/src/tests -m "not integration and not hugegraph and not smoke and not external" --cov=hugegraph_llm --cov-fail-under=34 --cov-report=term --cov-report=xml:llm-unit-contract.xml -v --tb=short --durations=20

imbajin

I found one blocking import/schema issue and several API contract / coverage issues that should be addressed before merge. Also, this PR removes the non-blocking ty-check workflow (uv run ty check hugegraph-llm/src hugegraph-python-client/src) without an equivalent replacement; please keep or replace that type-check signal.

imbajin · 2026-06-29T13:31:26Z

+                "id": "person:Tom Hanks",
+                "label": "person",
+                "properties": {"name": "Tom Hanks", "age": 67},
+            },


‼️ This test covers CUSTOMIZE_STRING only by calling load_into_graph() directly, so it bypasses the real import flow that first calls init_schema_if_need(). That production path still creates every vertex label with usePrimaryKeyId().primaryKeys(...), while load_into_graph() later writes explicit ids for id_strategy == "CUSTOMIZE_STRING". An inline schema declaring custom string ids can therefore create a primary-key schema before the custom-id write path runs. Please make init_schema_if_need() branch on id_strategy and call useCustomizeStringId() for CUSTOMIZE_STRING, and add an end-to-end Commit2Graph.run() or import-flow test for this schema.

Fixed in 10524d2. init_schema_if_need() now branches on id_strategy and uses useCustomizeStringId() for CUSTOMIZE_STRING vertex labels instead of creating a primary-key schema first. Added a Commit2Graph.run() regression test so the production init_schema_if_need() path is covered before load_into_graph().

imbajin · 2026-06-29T13:31:26Z

+    if not isinstance(vertices, list) or not isinstance(edges, list):
+        raise FlowOutputValidationError("property graph result must contain list vertices and edges")
+    for vertex in vertices:
+        if not isinstance(vertex, dict) or "label" not in vertex or "properties" not in vertex:


⚠️ The /graph/extract workflow-output contract is weaker than the import request contract. This only checks that each vertex/edge is a dict and has the required keys, but it does not verify that label/outV/outVLabel/inV/inVLabel are non-empty strings or that properties is an object. A scheduler result like {"vertices":[{"label":"person","properties":null}],"edges":[]} would be returned as a successful extract response even though /graph/import rejects the same property-graph payload. Please align this validator with GraphImportRequest.validate_data() and add a contract test for malformed workflow output.

Fixed in 10524d2. GraphExtractService now validates workflow property-graph output against the import contract: vertex label must be a non-empty string, vertex properties must be an object, edge label/outV/outVLabel/inV/inVLabel must be non-empty strings, and edge properties must be an object. Added endpoint/service regression coverage where the scheduler returns properties=null and /graph/extract responds with GRAPH_EXTRACT_INVALID_FLOW_OUTPUT.

imbajin · 2026-06-29T13:31:26Z

+                return JSONResponse(
+                    status_code=status.HTTP_422_UNPROCESSABLE_ENTITY,
+                    content={
+                        "detail": _error(


⚠️ This route class is installed while registering extract, job, import, and extract-and-import endpoints, but every request-validation failure is hard-coded as GRAPH_EXTRACT_VALIDATION_ERROR. Invalid /graph/import bodies therefore return an extract-specific code/phase, making client error handling ambiguous. Please choose the validation error code/phase based on the route, or use separate handlers for extract and import routes, and cover invalid import payloads in endpoint tests.

Fixed in 10524d2. GraphExtractAPIRoute now maps request-validation errors by route path. Invalid /graph/import bodies return GRAPH_IMPORT_VALIDATION_ERROR with phase=import instead of the extract-specific validation code. Added endpoint coverage that an invalid import payload is rejected before GraphImportService is called.

imbajin · 2026-06-29T13:31:26Z

+            # TODO: transform to Enum first (better in earlier step)
+            data_type = property_label["data_type"]
+            cardinality = property_label["cardinality"]
+            if not self._check_property_data_type(data_type, cardinality, value):


⚠️ The operator-level property type validator does not match the public request validator. The request model explicitly excludes bool for BYTE/INT/LONG, but this path uses isinstance(value, int), so Python accepts True/False as integer values when named-schema or internal workflow paths bypass the request-side inline-schema check. Please share one validator between GraphImportRequest and Commit2Graph, and add coverage for INT=True plus the intended FLOAT/DOUBLE behavior for JSON integer values.

Fixed in 10524d2. Property value validation is now shared in hugegraph_llm.utils.schema_property and used by both GraphImportRequest and Commit2Graph. Integer types exclude bool, and FLOAT/DOUBLE accept JSON numeric int/float values while still excluding bool. Added request and operator coverage for INT=True, DOUBLE=1, and the named/internal import path that bypasses request-side inline-schema validation.

imbajin · 2026-06-29T13:31:26Z

+    }
+
+
+def _client(service=None, job_store=None, run_jobs_inline=True):


⚠️ The production default for graph_extract_http_api(...) is run_jobs_inline=None, which submits work through jobs.submit_job() and daemon workers, but this helper defaults to inline execution and the tests only exercise inline mode or a deliberately non-running pending mode. Please add a route-level test that omits run_jobs_inline, creates a job, polls until it reaches a terminal state, and verifies result retrieval so the public async worker path is covered.

Fixed in 10524d2. Added a route-level test for the production default run_jobs_inline=None path: it creates a job, lets the background worker execute it, polls until SUCCEEDED, then retrieves the result from /graph/extract/jobs/{job_id}/result.

LRriver · 2026-06-29T13:57:19Z

Updated in 10524d2 after rebasing extract_api onto the latest origin/main.

Addressed the latest review items:

/graph/extract now validates workflow output against the property-graph import contract before returning success. Malformed vertices/edges now fail with GRAPH_EXTRACT_INVALID_FLOW_OUTPUT.
GraphImportRequest and Commit2Graph now share one property value validator. Integer types reject bool; FLOAT/DOUBLE accept JSON numeric int/float values while rejecting bool.
Commit2Graph.init_schema_if_need() now uses useCustomizeStringId() for CUSTOMIZE_STRING vertex labels.
/graph/import request-validation errors now return GRAPH_IMPORT_VALIDATION_ERROR with phase=import.
Added coverage for the default background-worker job route path (run_jobs_inline=None).

Workflow/type-check signal:

The branch is rebased onto current origin/main; git diff origin/main...HEAD -- .github is empty, so this PR no longer removes the workflow changes from main.
I ran uv run --extra dev ty check hugegraph-llm/src hugegraph-python-client/src; it executes now, but reports the existing repository-wide baseline (719 diagnostics) across unrelated files and test fixtures. I did not fold that broad type cleanup into this PR.

Local verification:

uv run ruff format --check .
uv run ruff check .
SKIP_EXTERNAL_SERVICES=true uv run pytest hugegraph-llm/src/tests/api hugegraph-llm/src/tests/operators -v --tb=short (329 passed)

github-actions Bot and others added 3 commits June 2, 2026 16:14

sync: preserve local workflow files

7bf6688

Merge remote-tracking branch 'upstream/main'

da27bf5

Merge branch 'apache:main' into main

800b150

Copilot AI review requested due to automatic review settings June 9, 2026 08:04

dosubot Bot added size:XXL This PR changes 1000+ lines, ignoring generated files. enhancement New feature or request labels Jun 9, 2026

Copilot AI reviewed Jun 9, 2026

View reviewed changes

github-actions Bot added the llm label Jun 9, 2026

github-actions Bot added 3 commits June 15, 2026 11:55

Merge remote-tracking branch 'upstream/main'

81d89e4

Merge remote-tracking branch 'upstream/main'

af4186a

Merge remote-tracking branch 'upstream/main'

f41baa2

LRriver force-pushed the extract_api branch 2 times, most recently from 00dee19 to e761776 Compare June 22, 2026 06:56

github-actions Bot added 3 commits June 23, 2026 06:46

Merge remote-tracking branch 'upstream/main'

b8d4b40

Merge remote-tracking branch 'upstream/main'

1a98cef

sync: preserve local workflow files

f3ba75d

imbajin reviewed Jun 29, 2026

View reviewed changes

LRriver and others added 12 commits June 29, 2026 21:50

feat(llm): add graph extraction service APIs

e1b23ad

test(llm): cover graph extraction service APIs

acda6ce

feat(llm): support graph extract content modes

5090f0c

test(llm): cover graph extract content modes

184bc73

fix(llm): harden graph extract runtime parsing

e297b3d

test(llm): cover graph extract API edge cases

402ba2e

fix(llm): return workflow node errors safely

6cedb5c

fix(llm): keep paragraph chunk boundaries

c435f5c

fix(llm): use extract LLM for graph extraction

3bb0c95

fix(llm): fall back extract LLM config to chat settings

7a6aff6

fix(llm): harden graph extract API follow-up

7aa8a50

fix(llm): address graph extract API review feedback

93c7c84

LRriver added 10 commits June 29, 2026 21:50

fix(llm): harden graph extract review follow-up

a2847c5

fix(llm): align graph import validation message

aac86e6

fix(llm): guard graph import service writes

cd9985d

fix(llm): sanitize graph extract error surfaces

9a044f3

test(llm): assert graph routes via openapi contract

f1c54ab

fix(llm): sanitize import warning messages

0c95388

fix(llm): avoid raw import payload errors

a8b18a4

fix(llm): validate graph import edge endpoints

a903a18

fix(llm): validate graph import properties

79a3d87

fix(llm): harden graph import contracts

10524d2

LRriver force-pushed the extract_api branch from 79ddbe9 to 10524d2 Compare June 29, 2026 13:55

		@router.post("/graph/import", status_code=status.HTTP_200_OK)
		def graph_import_api(req: GraphImportRequest) -> GraphImportResponse:

		@router.post("/graph/extract-and-import", status_code=status.HTTP_200_OK)
		def graph_extract_and_import_api(req: GraphExtractAndImportRequest) -> GraphExtractAndImportResponse:

		}


		def _client(service=None, job_store=None, run_jobs_inline=True):

Uh oh!

Conversation

LRriver commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Relation to #351

Write API Safety

Job Endpoint Notes

Tests

Review

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LRriver commented Jun 22, 2026

Uh oh!

imbajin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LRriver commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

LRriver commented Jun 9, 2026 •

edited

Loading