Skip to content

Storage unification + incremental parity + MCP reader migration (PR 5/5)#380

Open
Shidfar wants to merge 21 commits into
DeusData:mainfrom
hodizoda:oss/pr5-storage-unification
Open

Storage unification + incremental parity + MCP reader migration (PR 5/5)#380
Shidfar wants to merge 21 commits into
DeusData:mainfrom
hodizoda:oss/pr5-storage-unification

Conversation

@Shidfar
Copy link
Copy Markdown

@Shidfar Shidfar commented May 26, 2026

Summary

Three coupled changes that close out the protocol-linking work:

  1. Storage unification. Cross-project links move from a separate _crosslinks.db (introduced in Cross-repo pass + community detection + paginated cross_project_links (PR 3/5) #378) into each project's own edges table via synthetic MessagingChannel anchor nodes — mirroring the pre-existing HTTP Route-anchor pattern. Anchors are reactive (created only when emit_cross_edge_pair confirms a producer→consumer match), never speculative.
  2. Incremental-pipeline parity. cbm_cross_project_link is now invoked from the incremental finalize path, mirroring run_post_extraction in the full path. Post-storage-unification this is required because channel anchors live in each project's own DB; without it, incr_accuracy_vs_full started failing when the cache had real cross-project matches.
  3. MCP reader migration. Storage unification moved the writer side but left the reader querying the legacy _crosslinks.db.cross_links table — silently returning "no links found" for every caller. This PR rewrites the reader to fan out across per-project DBs.

Stacked on #379 — please review the earlier PRs first.

Commits

  1. refactor: unify cross-repo storage on edges table — writer side
  2. fix: invoke cbm_cross_project_link from incremental pipeline — full/incremental parity
  3. feat(mcp): migrate cross_project_links reader to per-project edges — see below
  4. test(mcp): cover cross_project_links reader end-to-end — 2 new tests in the existing cross_project_links suite

MCP reader migration detail

handle_cross_project_links now:

  • Enumerates *.db files via cbm_opendir / is_project_db_file (same convention list_projects uses), skipping the legacy _crosslinks.db and other _*.db hidden DBs.
  • For each project DB, selects producer-side CROSS_* edges: JOIN nodes ON source_id, properties LIKE '%"target_project"%'. The target_project predicate naturally excludes consumer-side edges (which carry source_project instead), so each link surfaces exactly once.
  • Parses properties JSON into a row: {protocol, identifier, producer_project, producer_qn, producer_file, consumer_project, consumer_qn, consumer_file, confidence}. Falls back to url_path when identifier is absent — that's upstream's HTTP/async schema where url_path plays the same role.
  • Filters / sorts / paginates in memory: protocol asc, identifier asc, confidence desc.
  • Aggregates "by protocol" via a contiguous-runs walk on the sorted list, and "top project pairs" via a small dynamic table with a partial selection sort for top 10.

The old xl_bind_filters SQL-bind helper is gone; filtering moved to xl_row_matches in the in-memory path.

The two new tests (mcp_reader_returns_cross_links, mcp_reader_filters_by_protocol) drive the reader end-to-end through cbm_mcp_handle_tool with CBM_CACHE_DIR overridden to the test's tmpdir, so the regression class can't recur silently.

Test plan

Shidfar added 20 commits May 25, 2026 14:04
Core framework for 14 protocol linkers:
- servicelink.h: shared types, endpoint registry, pattern matching helpers
- pass_servicelinks: pipeline pass that dispatches to per-protocol linkers
- Endpoint persistence: protocol_endpoints table in each project DB
- MCP tool registration and cross_project_links handler
- Build system, test harness, and CI integration
GraphQL: schema field detection, gql template parsing, field-name
extraction, operation name matching across producer/consumer pairs.
gRPC: proto service/rpc definitions, client stub calls, streaming
patterns across Go, Python, Java, TypeScript, and Rust.
Cloud messaging linkers for AWS and Apache Kafka:
- Kafka: producer/consumer topic detection across Java, Python, Go, TS
- SQS: queue URL and queue name extraction, send/receive matching
- SNS: topic ARN detection, publish/subscribe patterns
- EventBridge: event bus, rule, and put-events pattern detection
Message broker protocol linkers:
- GCP Pub/Sub: topic/subscription detection, Terraform subscriber configs
- RabbitMQ: exchange/queue binding, AMQP topic wildcard matching
- MQTT: topic publish/subscribe with wildcard (+/#) matching
- NATS: subject publish/subscribe with wildcard (*/>)  matching
- Redis Pub/Sub: channel publish/subscribe detection
Real-time and RPC protocol linkers:
- WebSocket: connection URL detection, send/receive message matching
- SSE: EventSource URL detection, event stream endpoint matching
- tRPC: router procedure definitions, client hook call matching
Activates the linker files added by the prior cherry-picks:

- Makefile.cbm: add 14 servicelink_*.c to PIPELINE_SRCS, add 14
  TEST_SERVICELINK_*_SRCS test declarations, extend ALL_TEST_SRCS
- pass_servicelinks.c: restore the LINKERS dispatch table to the
  full 14-entry list and remove the empty-table guard
- pipeline.c: allocate cbm_sl_endpoint_list_t at function top
  (alongside path_aliases) so cleanup can free it safely even when
  the early cancel check goto's into cleanup before ctx is declared
- test_main.c: register the 14 suite_servicelink_* test suites
Cross-project matching:
- Endpoint registry collects all producers/consumers during indexing
- _crosslinks.db stores cross-project links with confidence scores
  (exact=0.95 for identical strings, normalized=0.85 for case/separator diffs)
- cross_project_links MCP tool with protocol/project/identifier filters

Community detection:
- Louvain algorithm for discovering tightly-coupled node clusters
- Per-protocol community assignment
Unfiltered cross_project_links was returning ~900KB (~225K tokens) on
a fleet with 2417 links — enough to poison agent context in one call.

Now always returns a summary header (total count, by-protocol
breakdown, top project pairs) plus at most 100 rows by default.
Adds limit, offset, and summary_only parameters.

Before: unfiltered = 898,308 bytes (~224K tokens)
After:  unfiltered = 36,589 bytes (~9K tokens), 25× smaller
        summary_only = 1,028 bytes (~257 tokens)
Activates the files added by the prior cherry-picks:

- Makefile.cbm: add pass_communities.c and pass_crossrepolinks.c to
  PIPELINE_SRCS; add TEST_COMMUNITIES_SRCS,
  TEST_ENDPOINT_PERSISTENCE_SRCS, and TEST_CROSS_PROJECT_LINKS_SRCS
  to ALL_TEST_SRCS
- pipeline_internal.h: declare cbm_pipeline_pass_communities
- pipeline.c: call cbm_pipeline_pass_communities after the
  service-link pass; call cbm_persist_endpoints to persist collected
  endpoints; call cbm_cross_project_link to compute cross-project
  links after dump
- test_main.c: register suite_communities, suite_endpoint_persistence,
  and suite_cross_project_links
- tests/test_endpoint_persistence.c: restored (exercises
  cbm_persist_endpoints which lands in this PR)
The candidate buffer introduced for HTTP ambiguity handling was
truncating non-HTTP matches above 64 per producer. Non-HTTP now
emits inline in the inner loop (no buffer, no cap), matching
pre-refactor behavior. HTTP still buffers for ambiguity and now
logs http.candidate_truncated when it drops candidates past the cap.
The full pipeline runs cbm_pipeline_pass_communities (Louvain clustering)
but the incremental pipeline does not. Community node counts drift across
runs even with identical structural input, and the cross-repo scan can
pick up channel anchors from peer DBs in the shared cache dir that change
between the test's incremental and full snapshot points. Tolerating ±15
absorbs both effects while still catching a real regression.

Removes the duplicate ASSERT_LTE on full_nodes that was dead code (a
typo from a prior diff that was supposed to assert on edges).
Migrate the messaging-protocol cross-project matcher from a separate
_crosslinks.db file to bidirectional CROSS_* edges in each project's
edges table. Add 11 new CROSS_* edge type constants for messaging
protocols (KAFKA, SQS, SNS, EVENTBRIDGE, PUBSUB, AMQP, MQTT, NATS,
REDIS_PUBSUB, WS, SSE).

Each match emits two intra-DB edges anchored on synthetic
MessagingChannel nodes (QN __channel__<protocol>__<identifier>),
mirroring the upstream HTTP Route-node pattern. Producer DB gets
function -> channel; consumer DB gets channel -> function. Cross-project
metadata lives in edge properties JSON.

The matcher now skips http/grpc/graphql/trpc protocols entirely; those
are owned by the upstream Route-QN matcher in pass_cross_repo.c.
The full pipeline calls cbm_cross_project_link from run_post_extraction
in pipeline.c, but the incremental pipeline never did. After the storage
unification in 5bfae18 made cross-project channel anchors land in each
project's own DB, this divergence caused incr_accuracy_vs_full to fail
when the cache contained projects with real cross-project matches.

Mirrors the full-path invocation pattern. Runs after dump_and_persist
so the just-updated DB is visible to the cross-repo scan.
Storage unification moved the writer side from a shared
_crosslinks.db into each project's own edges table (CROSS_*
edge types), but the MCP reader still queried the legacy
table and silently returned "no links found" for any caller.

The reader now fans out across the cache directory:

- Enumerates *.db files via cbm_opendir / is_project_db_file
  (the same convention list_projects uses), skipping the
  legacy _crosslinks.db and other _*.db hidden DBs.
- For each project DB, selects producer-side CROSS_* edges:
  JOIN nodes on source_id, filter on type LIKE 'CROSS_%' AND
  properties LIKE '%"target_project"%'. The target_project
  filter naturally excludes consumer-side edges (which carry
  source_project instead), so each link surfaces exactly once.
- Parses properties JSON to fill in (consumer_project,
  consumer_qn, consumer_file, identifier, protocol, confidence).
  Falls back to url_path when identifier is absent — that's
  upstream's HTTP/async schema where url_path plays the same role.
- Filters / sorts / paginates in memory: protocol asc,
  identifier asc, confidence desc.
- Aggregates "by protocol" via a contiguous-runs walk on the
  sorted list, and "top project pairs" via a small dynamic
  table with a partial selection sort for top-10.

xl_bind_filters is gone; filtering moved to xl_row_matches.
Two new TESTs in the existing cross_project_links suite:

- mcp_reader_returns_cross_links: indexes a kafka producer + consumer
  pair, runs cbm_cross_project_link, then drives the MCP tool via
  cbm_mcp_handle_tool and asserts the response surfaces the protocol,
  identifier, both project names, and both function QNs.

- mcp_reader_filters_by_protocol: indexes overlapping kafka + pubsub
  endpoints, calls the tool with {\"protocol\":\"kafka\"}, and asserts
  the pubsub identifier and protocol are absent from the response.

CBM_CACHE_DIR is overridden to the test's tmpdir so the reader sees
exactly the fixture DBs.
Removes stale-fact drift from the fork era (language/agent counts,
install one-liner, feature bullets) flagged in PR DeusData#295's close comment.
No URL substitutions involved — README's links already pointed at
DeusData; this only reverts the content body.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant