Skip to content

perf(query): reuse Memgraph connection in semantic search tools (#505)#545

Merged
vitali87 merged 2 commits into
vitali87:mainfrom
ChetanyaRathi:perf/reuse-memgraph-conn-semantic-search
Jun 29, 2026
Merged

perf(query): reuse Memgraph connection in semantic search tools (#505)#545
vitali87 merged 2 commits into
vitali87:mainfrom
ChetanyaRathi:perf/reuse-memgraph-conn-semantic-search

Conversation

@ChetanyaRathi

Copy link
Copy Markdown
Contributor

Summary

  • Reuse the orchestrator's existing MemgraphIngestor in semantic_code_search and get_function_source_code instead of opening a new Memgraph connection on every call.
  • Switch the private ingestor._execute_query() calls to the public fetch_all() API.
  • Thread the ingestor through the create_semantic_search_tool / create_get_function_source_tool factories and their call sites in main.py and mcp/tools.py. Behavior is unchanged — fetch_all(query, params) is the public wrapper around _execute_query(query, params).

Type of Change

  • Bug fix
  • New feature
  • Performance improvement
  • Refactoring (no functional changes)
  • Documentation
  • CI/CD or tooling
  • Dependencies

Related Issues

Closes #505

Test Plan

  • Added test_semantic_code_search_reuses_injected_ingestor, which asserts the injected ingestor's fetch_all() is called and _execute_query() is never called — a regression guard against re-introducing a per-call connection.
  • Migrated the existing test_semantic_search.py suite to pass a mock ingestor directly rather than patching MemgraphIngestor.
  • Ran uv run pytest -m "not integration" locally: all 4019 non-integration tests pass. The only failures are the Java-oracle tests (test_java_*_oracle.py), which shell out to javac and fail solely because a JDK isn't installed in my local environment — unrelated to this change.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the semantic search and function source tools to accept an injected ingestor instance rather than instantiating MemgraphIngestor internally. This change decouples the tools from configuration settings and simplifies database query execution by utilizing fetch_all instead of _execute_query. Corresponding updates were made to the MCP tools, main initialization, and test suites. Feedback on the changes suggests using QueryProtocol instead of the concrete MemgraphIngestor class for type annotations to resolve type mismatches with static analysis tools, and using .get() with fallbacks instead of direct dictionary access when parsing query results to prevent potential KeyError exceptions.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread codebase_rag/tools/semantic_search.py Outdated
Comment on lines +53 to +68
result = results_map[node_id]
result_type = result["type"]
type_str = (
result_type[0]
if isinstance(result_type, list) and result_type
else cs.SEMANTIC_TYPE_UNKNOWN
)
formatted_results.append(
SemanticSearchResult(
node_id=node_id,
qualified_name=str(result["qualified_name"]),
name=str(result["name"]),
type=type_str,
score=round(score, 3),
)
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To prevent potential KeyError exceptions at runtime, it is safer to use .get() when retrieving fields from the query results. This is especially important for dynamic graph databases where some properties (like type, qualified_name, or name) might occasionally be missing or malformed on certain nodes.

Using .get() with appropriate fallbacks also aligns with how other fields (like path, start_line, and end_line) are defensively retrieved in get_function_source_code later in this file.

Suggested change
result = results_map[node_id]
result_type = result["type"]
type_str = (
result_type[0]
if isinstance(result_type, list) and result_type
else cs.SEMANTIC_TYPE_UNKNOWN
)
formatted_results.append(
SemanticSearchResult(
node_id=node_id,
qualified_name=str(result["qualified_name"]),
name=str(result["name"]),
type=type_str,
score=round(score, 3),
)
)
result = results_map[node_id]
result_type = result.get("type")
type_str = (
result_type[0]
if isinstance(result_type, list) and result_type
else cs.SEMANTIC_TYPE_UNKNOWN
)
formatted_results.append(
SemanticSearchResult(
node_id=node_id,
qualified_name=str(result.get("qualified_name", "")),
name=str(result.get("name", "")),
type=type_str,
score=round(score, 3),
)
)

@greptile-apps

greptile-apps Bot commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR reuses the existing Memgraph ingestor for semantic search tools. The main changes are:

  • Injects the orchestrator or MCP registry ingestor into semantic search tool factories.
  • Replaces private graph query calls with the public fetch_all() API.
  • Runs semantic search and source lookup work off the async event loop.
  • Updates semantic search tests to use injected mock ingestors.

Confidence Score: 5/5

The change is narrowly scoped to dependency injection and public query API usage for semantic search tooling.

No correctness issues were identified in the reviewed changes, and the updated tests cover the intended ingestor reuse behavior.

T-Rex T-Rex Logs

What T-Rex did

  • Compared the base and head code paths for semantic_code_search and create_semantic_search_tool, and both runs returned the same two formatted results and the same tool response; the base path used MemgraphIngestor twice with two _execute_query calls and the head path used fetch_all twice.
  • Verified that both direct and tool paths produced the same expected source and exited with code 0, and that the logs captured the exact script content along with the command and working directory.
  • Observed the call-site wiring change: base code called create_semantic_search_tool() and create_get_function_source_tool() with no arguments, while head code called these factories with an ingestor argument (and similar self.ingestor variants), with the factory definitions still showing empty raw_args/args; runtime import was blocked by missing loguru, but the AST/regex probe confirmed the signature changes without starting external services.

View all artifacts

T-Rex Ran code and verified through T-Rex

Comments Outside Diff (1)

  1. General comment

    P1 Semantic search factory call sites pass ingestor to zero-argument factories

    • Bug
      • At head, codebase_rag/main.py calls both semantic factory functions with ingestor, and codebase_rag/mcp/tools.py calls them with self.ingestor. However, codebase_rag/tools/semantic_search.py still defines create_semantic_search_tool() and create_get_function_source_tool() with no parameters. These changed call sites will raise TypeError: ... takes 0 positional arguments but 1 was given when semantic dependencies are available or orchestrator initialization reaches these tools.
    • Cause
      • The call sites were updated to thread the existing ingestor instance, but the semantic tool factory signatures and implementations at the checked head commit were not updated to accept and use that ingestor.
    • Fix
      • Update create_semantic_search_tool and create_get_function_source_tool to accept an ingestor/query protocol parameter and ensure their inner tool functions use that instance instead of any per-call/global connection path. Alternatively, revert the call-site argument change, but that would not satisfy the intended shared-ingestor contract.

    T-Rex Ran code and verified through T-Rex

Reviews (2): Last reviewed commit: "Address PR review: QueryProtocol, defens..." | Re-trigger Greptile

Comment thread codebase_rag/mcp/tools.py
)

self._semantic_search_tool = create_semantic_search_tool()
self._semantic_search_tool = create_semantic_search_tool(self.ingestor)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Shared Connection Concurrent Reads

This captures the registry's long-lived ingestor for semantic search, while other MCP handlers use the same ingestor from worker threads. A semantic search request can now run fetch_all() on the event-loop thread at the same time as list_projects() or another handler runs on the same mgclient connection, and _get_cursor() only locks cursor creation, not query execution. Concurrent MCP requests can corrupt or fail graph reads because the shared connection is not serialized.

Prompt To Fix With AI
This is a comment left during a code review.
Path: codebase_rag/mcp/tools.py
Line: 96

Comment:
**Shared Connection Concurrent Reads**

This captures the registry's long-lived ingestor for semantic search, while other MCP handlers use the same ingestor from worker threads. A semantic search request can now run `fetch_all()` on the event-loop thread at the same time as `list_projects()` or another handler runs on the same mgclient connection, and `_get_cursor()` only locks cursor creation, not query execution. Concurrent MCP requests can corrupt or fail graph reads because the shared connection is not serialized.

How can I resolve this? If you propose a fix, please make it concise.

Comment thread codebase_rag/mcp/tools.py
self._shell_command_tool,
self._directory_lister_tool,
create_get_function_source_tool(),
create_get_function_source_tool(self.ingestor),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Agent Tool Shares Connection

The source-code tool captured by the long-lived agent now uses the same registry ingestor as the rest of the MCP server. When ask_agent invokes this tool while another MCP request is reading or updating the graph through self.ingestor, source lookup can execute on the same connection without the registry lock. That can make source retrieval fail intermittently or break the shared Memgraph session.

Artifacts

Repro: focused shared-ingestor concurrency harness

  • Contains supporting evidence from the run (text/x-python; charset=utf-8).

Repro: execution log showing concurrent use of the shared ingestor

  • Keeps the command output available without making the summary code-heavy.

View artifacts

T-Rex Ran code and verified through T-Rex

Prompt To Fix With AI
This is a comment left during a code review.
Path: codebase_rag/mcp/tools.py
Line: 347

Comment:
**Agent Tool Shares Connection**

The source-code tool captured by the long-lived agent now uses the same registry ingestor as the rest of the MCP server. When `ask_agent` invokes this tool while another MCP request is reading or updating the graph through `self.ingestor`, source lookup can execute on the same connection without the registry lock. That can make source retrieval fail intermittently or break the shared Memgraph session.

How can I resolve this? If you propose a fix, please make it concise.

…load

Use QueryProtocol for ingestor typing to match main.py and other tools.
Parse graph result fields with .get() fallbacks. Offload ingestor I/O via
asyncio.to_thread in async tool wrappers, matching query_code_graph pattern.

@vitali87 vitali87 left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChetanyaRathi thank you for the PR. Please clear all greptile and gemini comments until the greptile Confidence Score becomes 5.
I will review after.

@ChetanyaRathi

Copy link
Copy Markdown
Contributor Author

@greptileai please re-review. The previous pass ran against the first commit (c74e94e); the latest commit (f95bd55) addresses the open feedback:

  • Type hints: the ingestor params now use QueryProtocol in semantic_search.py and both tool factories, matching how main.py already annotates it (resolves the mypy/pyright mismatch).
  • Defensive parsing: the semantic-search result mapping uses .get() with fallbacks instead of direct key access.
  • Threading: the blocking fetch_all() calls in the semantic-search and source-lookup tools are offloaded via asyncio.to_thread, so they no longer run on the event loop.

On the two "shared connection" P1s, I believe these are false positives: every query on the shared ingestor is serialized by self._conn_lock in graph_service.py. The yield in _get_cursor() sits inside the with self._conn_lock: block, so the lock is held across execute() and the fetch, not just cursor creation. fetch_all() / execute_write() / list_projects() all route through it, so concurrent calls queue on the lock rather than colliding on the connection. Happy to walk through it inline if the re-run still flags them.

@ChetanyaRathi

Copy link
Copy Markdown
Contributor Author

@vitali87 Greptile is now at 5/5 on the latest commit (f95bd55), and I've resolved the open review threads, also ready for your review whenever you get a chance. Thank you!

@vitali87 vitali87 merged commit 32e50f5 into vitali87:main Jun 29, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

perf(query): semantic search creates new Memgraph connection per call instead of reusing

2 participants