Skip to content

Stale ~lineage entries cause spurious semantic-check failures; redeclare should overwrite #1454

@dimitri-yatsenko

Description

@dimitri-yatsenko

Problem

During the Built-On demo prep (May 18–19, 2026) on a Lakebase / PostgreSQL backend
we hit:

DataJointError: Cannot join on attribute 'image_id': different lineages
(blob_detection.image.image_id vs None). Use .proj() to rename one of the attributes.

on the very first populate() of a dj.Imported / dj.Computed table whose PK was
fully FK-inherited from a single parent. The semantic-check is correct in principle:
both sides of a populate antijoin should agree on the source-table lineage of the
inherited PK attribute. After a clean schema drop + redeclare on the same workspace,
the bug went away — Image.proj().heading["image_id"].lineage came back equal to
ImageSpec.proj().heading["image_id"].lineage, both 'blob_detection.image_spec.image_id'.

This pattern is concerning because:

  • The user can't tell from the error message whether the issue is their schema
    (e.g., a real lineage mismatch the semantic-check exists to catch) or a stale
    internal record they have no obvious way to inspect.
  • The recovery (schema.rebuild_lineage() or full drop+redeclare) is undocumented
    in the error.
  • Originally I (incorrectly) diagnosed this as a missing fix in the populate-antijoin
    code path and opened fix: Disable semantic_check on populate antijoin (parallels #1383) #1453 to disable semantic_check; closed once we found the
    lineage table itself had been the issue. The right fix lives in declaration /
    rebuild logic, not in the antijoin.

Speculation: what could leave stale entries in ~lineage?

These are speculations, not verified causes — flagged for investigation:

  1. DJ version skew across the table's history. A table is first declared under
    a DJ version whose declaration logic writes a lineage row, then later the user
    upgrades DJ — the new code expects different lineage-string formatting (e.g.,
    the pre-6506badb PostgreSQL-double-quote situation we hit in January) and the
    live row no longer matches what proj() computes at query time. The table
    itself looks healthy via information_schema; only ~lineage is stale.

  2. Partial declare on PostgreSQL where the lineage insert raises silently or
    gets rolled back independently of the DDL. Lakebase + PG 17 may have quirks here
    that MySQL doesn't.

  3. Schema-definition changes that don't re-touch the table. Today's @schema
    decorator path may skip re-writing ~lineage if the table already exists
    (is_declared == True). If the in-code definition diverges from what's stored
    — e.g., a FK retargeting that doesn't change column types and so passes the
    schema-equivalence check — ~lineage keeps the old parent linkage forever.

  4. Manual DROP TABLE outside DataJoint that leaves orphan ~lineage rows
    pointing at a no-longer-existent table.

  5. Multiple schemas with overlapping table names during dev (we drop+redeclare
    many times during this kind of work). If the ~lineage cleanup keys by table
    identity in a way that differs from the redeclare's identity, ghost rows can
    accumulate.

Proposed direction

Even without pinning the root cause, the user-visible failure is bad enough to
warrant defensive declaration logic:

Short term (declaration robustness)

  • On every @schema decoration of a table (whether is_declared is True or False),
    clear and re-insert the table's rows in ~lineage from the current FK
    definition. Make declaration idempotent for ~lineage, the same way it is for
    the actual table structure.
  • Add a fast post-declare consistency check (or invariant assertion under
    safemode=True): for each PK attribute, the lineage recorded in ~lineage
    matches the lineage that Table.proj() would compute. If not, log a warning
    pointing the user at schema.rebuild_lineage().

Medium term (better error)

When the semantic-check on the populate antijoin fails with a None lineage on
one side — which is by construction either a freshly-declared-but-not-yet-saved
table or a stale row — surface a tailored error:

Cannot join on attribute 'X': lineage missing on <table_name>. This usually
indicates a stale `~lineage` entry from an older DataJoint version or an
incomplete redeclare. Run:
    schema.rebuild_lineage()
to recompute from current FK definitions.

Long term (schema versioning)

Tag every ~lineage row with the DJ minor version that wrote it. On read,
if the row's version is older than a known floor (e.g., the version that
introduced clean PG-quote stripping), trigger an in-place rebuild for that
row before returning it. This keeps the cost off the hot path for current-version
schemas while making upgrades self-healing.

Repro vector (for whoever picks this up)

We don't have a deterministic reproducer because the failing state on the workspace
is gone (cleaned via drop+redeclare). But a likely synthetic repro:

  1. Apply 6506badb
    in reverse (or check out master before that SHA) and declare a schema with an
    FK against PostgreSQL.
  2. Inspect ~lineage — entries should carry double-quotes inside the lineage
    string ("db"."table".attr rather than db.table.attr).
  3. Upgrade DJ to master, don't drop the schema. Try populate() on a child
    table — semantic-check failure should reappear.
  4. Confirm schema.rebuild_lineage() clears it; confirm a full schema redeclare
    clears it; confirm that just re-running @schema(Table) without recomputing
    the row does not clear it.

What was not the cause (closing #1453's misdirection)

The populate antijoin's semantic-check at
(self._jobs_to_do(restrictions) - self.proj()) is the right design. Disabling
it (as #1453 proposed) would silence the entire class of correctness bugs that
#1405 added it to catch. This issue is specifically about making the
lineage table trustworthy, not about loosening the check that reads it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions