Storage unification + incremental parity + MCP reader migration (PR 5/5) by Shidfar · Pull Request #380 · DeusData/codebase-memory-mcp

Shidfar · 2026-05-26T08:48:53Z

Summary

Three coupled changes that close out the protocol-linking work:

Storage unification. Cross-project links move from a separate _crosslinks.db (introduced in Cross-repo pass + community detection + paginated cross_project_links (PR 3/5) #378) into each project's own edges table via synthetic MessagingChannel anchor nodes — mirroring the pre-existing HTTP Route-anchor pattern. Anchors are reactive (created only when emit_cross_edge_pair confirms a producer→consumer match), never speculative.
Incremental-pipeline parity. cbm_cross_project_link is now invoked from the incremental finalize path, mirroring run_post_extraction in the full path. Post-storage-unification this is required because channel anchors live in each project's own DB; without it, incr_accuracy_vs_full started failing when the cache had real cross-project matches.
MCP reader migration. Storage unification moved the writer side but left the reader querying the legacy _crosslinks.db.cross_links table — silently returning "no links found" for every caller. This PR rewrites the reader to fan out across per-project DBs.

Stacked on #379 — please review the earlier PRs first.

Commits

refactor: unify cross-repo storage on edges table — writer side
fix: invoke cbm_cross_project_link from incremental pipeline — full/incremental parity
feat(mcp): migrate cross_project_links reader to per-project edges — see below
test(mcp): cover cross_project_links reader end-to-end — 2 new tests in the existing cross_project_links suite

MCP reader migration detail

handle_cross_project_links now:

Enumerates *.db files via cbm_opendir / is_project_db_file (same convention list_projects uses), skipping the legacy _crosslinks.db and other _*.db hidden DBs.
For each project DB, selects producer-side CROSS_* edges: JOIN nodes ON source_id, properties LIKE '%"target_project"%'. The target_project predicate naturally excludes consumer-side edges (which carry source_project instead), so each link surfaces exactly once.
Parses properties JSON into a row: {protocol, identifier, producer_project, producer_qn, producer_file, consumer_project, consumer_qn, consumer_file, confidence}. Falls back to url_path when identifier is absent — that's upstream's HTTP/async schema where url_path plays the same role.
Filters / sorts / paginates in memory: protocol asc, identifier asc, confidence desc.
Aggregates "by protocol" via a contiguous-runs walk on the sorted list, and "top project pairs" via a small dynamic table with a partial selection sort for top 10.

The old xl_bind_filters SQL-bind helper is gone; filtering moved to xl_row_matches in the in-memory path.

The two new tests (mcp_reader_returns_cross_links, mcp_reader_filters_by_protocol) drive the reader end-to-end through cbm_mcp_handle_tool with CBM_CACHE_DIR overridden to the test's tmpdir, so the regression class can't recur silently.

Test plan

./scripts/test.sh passes (3825/3825, ASan + UBSan)
New MCP reader tests green
No MessagingChannel nodes created speculatively (confirmed via cross_link_no_match and the find_or_create_channel call-site audit from Cross-project HTTP edges + unified storage + paginated cross_project_links #295)

Core framework for 14 protocol linkers: - servicelink.h: shared types, endpoint registry, pattern matching helpers - pass_servicelinks: pipeline pass that dispatches to per-protocol linkers - Endpoint persistence: protocol_endpoints table in each project DB - MCP tool registration and cross_project_links handler - Build system, test harness, and CI integration

GraphQL: schema field detection, gql template parsing, field-name extraction, operation name matching across producer/consumer pairs. gRPC: proto service/rpc definitions, client stub calls, streaming patterns across Go, Python, Java, TypeScript, and Rust.

Cloud messaging linkers for AWS and Apache Kafka: - Kafka: producer/consumer topic detection across Java, Python, Go, TS - SQS: queue URL and queue name extraction, send/receive matching - SNS: topic ARN detection, publish/subscribe patterns - EventBridge: event bus, rule, and put-events pattern detection

Message broker protocol linkers: - GCP Pub/Sub: topic/subscription detection, Terraform subscriber configs - RabbitMQ: exchange/queue binding, AMQP topic wildcard matching - MQTT: topic publish/subscribe with wildcard (+/#) matching - NATS: subject publish/subscribe with wildcard (*/>) matching - Redis Pub/Sub: channel publish/subscribe detection

Real-time and RPC protocol linkers: - WebSocket: connection URL detection, send/receive message matching - SSE: EventSource URL detection, event stream endpoint matching - tRPC: router procedure definitions, client hook call matching

Activates the linker files added by the prior cherry-picks: - Makefile.cbm: add 14 servicelink_*.c to PIPELINE_SRCS, add 14 TEST_SERVICELINK_*_SRCS test declarations, extend ALL_TEST_SRCS - pass_servicelinks.c: restore the LINKERS dispatch table to the full 14-entry list and remove the empty-table guard - pipeline.c: allocate cbm_sl_endpoint_list_t at function top (alongside path_aliases) so cleanup can free it safely even when the early cancel check goto's into cleanup before ctx is declared - test_main.c: register the 14 suite_servicelink_* test suites

Cross-project matching: - Endpoint registry collects all producers/consumers during indexing - _crosslinks.db stores cross-project links with confidence scores (exact=0.95 for identical strings, normalized=0.85 for case/separator diffs) - cross_project_links MCP tool with protocol/project/identifier filters Community detection: - Louvain algorithm for discovering tightly-coupled node clusters - Per-protocol community assignment

Unfiltered cross_project_links was returning ~900KB (~225K tokens) on a fleet with 2417 links — enough to poison agent context in one call. Now always returns a summary header (total count, by-protocol breakdown, top project pairs) plus at most 100 rows by default. Adds limit, offset, and summary_only parameters. Before: unfiltered = 898,308 bytes (~224K tokens) After: unfiltered = 36,589 bytes (~9K tokens), 25× smaller summary_only = 1,028 bytes (~257 tokens)

Activates the files added by the prior cherry-picks: - Makefile.cbm: add pass_communities.c and pass_crossrepolinks.c to PIPELINE_SRCS; add TEST_COMMUNITIES_SRCS, TEST_ENDPOINT_PERSISTENCE_SRCS, and TEST_CROSS_PROJECT_LINKS_SRCS to ALL_TEST_SRCS - pipeline_internal.h: declare cbm_pipeline_pass_communities - pipeline.c: call cbm_pipeline_pass_communities after the service-link pass; call cbm_persist_endpoints to persist collected endpoints; call cbm_cross_project_link to compute cross-project links after dump - test_main.c: register suite_communities, suite_endpoint_persistence, and suite_cross_project_links - tests/test_endpoint_persistence.c: restored (exercises cbm_persist_endpoints which lands in this PR)

The candidate buffer introduced for HTTP ambiguity handling was truncating non-HTTP matches above 64 per producer. Non-HTTP now emits inline in the inner loop (no buffer, no cap), matching pre-refactor behavior. HTTP still buffers for ambiguity and now logs http.candidate_truncated when it drops candidates past the cap.

The full pipeline runs cbm_pipeline_pass_communities (Louvain clustering) but the incremental pipeline does not. Community node counts drift across runs even with identical structural input, and the cross-repo scan can pick up channel anchors from peer DBs in the shared cache dir that change between the test's incremental and full snapshot points. Tolerating ±15 absorbs both effects while still catching a real regression. Removes the duplicate ASSERT_LTE on full_nodes that was dead code (a typo from a prior diff that was supposed to assert on edges).

Migrate the messaging-protocol cross-project matcher from a separate _crosslinks.db file to bidirectional CROSS_* edges in each project's edges table. Add 11 new CROSS_* edge type constants for messaging protocols (KAFKA, SQS, SNS, EVENTBRIDGE, PUBSUB, AMQP, MQTT, NATS, REDIS_PUBSUB, WS, SSE). Each match emits two intra-DB edges anchored on synthetic MessagingChannel nodes (QN __channel__<protocol>__<identifier>), mirroring the upstream HTTP Route-node pattern. Producer DB gets function -> channel; consumer DB gets channel -> function. Cross-project metadata lives in edge properties JSON. The matcher now skips http/grpc/graphql/trpc protocols entirely; those are owned by the upstream Route-QN matcher in pass_cross_repo.c.

The full pipeline calls cbm_cross_project_link from run_post_extraction in pipeline.c, but the incremental pipeline never did. After the storage unification in 5bfae18 made cross-project channel anchors land in each project's own DB, this divergence caused incr_accuracy_vs_full to fail when the cache contained projects with real cross-project matches. Mirrors the full-path invocation pattern. Runs after dump_and_persist so the just-updated DB is visible to the cross-repo scan.

Storage unification moved the writer side from a shared _crosslinks.db into each project's own edges table (CROSS_* edge types), but the MCP reader still queried the legacy table and silently returned "no links found" for any caller. The reader now fans out across the cache directory: - Enumerates *.db files via cbm_opendir / is_project_db_file (the same convention list_projects uses), skipping the legacy _crosslinks.db and other _*.db hidden DBs. - For each project DB, selects producer-side CROSS_* edges: JOIN nodes on source_id, filter on type LIKE 'CROSS_%' AND properties LIKE '%"target_project"%'. The target_project filter naturally excludes consumer-side edges (which carry source_project instead), so each link surfaces exactly once. - Parses properties JSON to fill in (consumer_project, consumer_qn, consumer_file, identifier, protocol, confidence). Falls back to url_path when identifier is absent — that's upstream's HTTP/async schema where url_path plays the same role. - Filters / sorts / paginates in memory: protocol asc, identifier asc, confidence desc. - Aggregates "by protocol" via a contiguous-runs walk on the sorted list, and "top project pairs" via a small dynamic table with a partial selection sort for top-10. xl_bind_filters is gone; filtering moved to xl_row_matches.

Two new TESTs in the existing cross_project_links suite: - mcp_reader_returns_cross_links: indexes a kafka producer + consumer pair, runs cbm_cross_project_link, then drives the MCP tool via cbm_mcp_handle_tool and asserts the response surfaces the protocol, identifier, both project names, and both function QNs. - mcp_reader_filters_by_protocol: indexes overlapping kafka + pubsub endpoints, calls the tool with {\"protocol\":\"kafka\"}, and asserts the pubsub identifier and protocol are absent from the response. CBM_CACHE_DIR is overridden to the test's tmpdir so the reader sees exactly the fixture DBs.

Removes stale-fact drift from the fork era (language/agent counts, install one-liner, feature bullets) flagged in PR DeusData#295's close comment. No URL substitutions involved — README's links already pointed at DeusData; this only reverts the content body. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Shidfar added 20 commits May 25, 2026 14:04

feat: add WebSocket, SSE, and tRPC protocol linkers

e4ada4f

Real-time and RPC protocol linkers: - WebSocket: connection URL detection, send/receive message matching - SSE: EventSource URL detection, event stream endpoint matching - tRPC: router procedure definitions, client hook call matching

feat: add HTTP servicelinker plumbing

0d14141

feat: implement HTTP cross-project endpoint registration

723afa2

feat: add HTTP-aware cross-repo matcher with ambiguity handling

4d20d88

test: add HTTP cross-project linker tests and fixtures

e34234e

fix: make S2 and S3 signals reachable in HTTP linker

39bbeb2

Shidfar mentioned this pull request May 26, 2026

Cross-project HTTP edges + unified storage + paginated cross_project_links #295

Closed

4 tasks

Shidfar marked this pull request as ready for review May 26, 2026 11:34

This was referenced May 26, 2026

14 cross-service protocol linkers (PR 2/5) #377

Draft

Cross-repo pass + community detection + paginated cross_project_links (PR 3/5) #378

Draft

HTTP cross-project edges + 4-signal endpoint registration (PR 4/5) #379

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Storage unification + incremental parity + MCP reader migration (PR 5/5)#380

Storage unification + incremental parity + MCP reader migration (PR 5/5)#380
Shidfar wants to merge 21 commits into
DeusData:mainfrom
hodizoda:oss/pr5-storage-unification

Shidfar commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Shidfar commented May 26, 2026

Summary

Commits

MCP reader migration detail

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant