Skip to content

fix(dataset): resolve Blob v2 external URIs and robustly clean failed writes in add_columns#7152

Open
yyzhao2025 wants to merge 2 commits into
lance-format:mainfrom
yyzhao2025:yyzhao2025/add_columns_blobv2
Open

fix(dataset): resolve Blob v2 external URIs and robustly clean failed writes in add_columns#7152
yyzhao2025 wants to merge 2 commits into
lance-format:mainfrom
yyzhao2025:yyzhao2025/add_columns_blobv2

Conversation

@yyzhao2025
Copy link
Copy Markdown
Contributor

Description

This PR addresses two critical functional gaps in the add_columns pipeline when working with Blob v2 datasets. It ensures that the schema evolution path achieves feature parity with the main write path regarding external URI resolution, and it introduces a robust, leak-free cleanup mechanism for failed mutations.

1. Blob v2 External URI Resolution

Context:
In Blob v2, external blobs (where data resides outside the dataset's native storage) rely on an ExternalBaseResolver to construct the correct absolute URIs from relative paths stored in the dataset manifest.
Previously, add_columns used a bare open_writer via the Updater, which lacked the context to initialize this resolver. As a result, appending or mutating Blob v2 columns with external kind would fail to resolve URIs.

Fix:

  • Introduced open_update_writer in write.rs. When the Updater creates a new writer, it now checks if storage_version >= V2_2 and if the schema contains blob_v2 fields.
  • If so, it constructs the ExternalBaseResolver using the dataset's registered base paths and passes it to the writer alongside the source_store_registry.
  • This seamlessly enables operations like BatchUDF or RecordBatchReader to write external Blob v2 data during schema evolution, achieving strict parity with write_fragments_internal.

2. Comprehensive Cleanup for Failed Writes

Context:
The add_columns operation is complex and can fail mid-flight due to various reasons: UDF execution panics, stream ingestion errors, checkpoint lookup/insert failures, or schema merge conflicts. Previously, these failures would eagerly propagate via the ? operator, leaving behind orphaned .lance data files and their corresponding Blob v2 sidecar directories.

Fix & Architectural Guarantees:
This PR overhauls the error handling and cleanup lifecycle in add_columns_impl and add_columns_from_stream.

  • Multi-stage Cleanup:
    We replaced concurrent try_collect streams with sequential processing augmented with scoped error handling. When a failure occurs, the pipeline now executes a two-stage cleanup:

    1. updater.cleanup_unfinished_writer(): Cleans up the currently active, unfinished data file that hasn't been finalized into a Fragment yet.
    2. cleanup_new_column_data_files(): Cleans up any fully written but uncommitted fragments generated in the current run.
  • Strict Safety Constraint 1: Preservation of External Data
    The underlying cleanup_data_fragments logic now strictly checks base_id.is_none(). This guarantees that cleanup operations never attempt to delete files that belong to an external base, preventing catastrophic deletion of user-managed source data.

  • Strict Safety Constraint 2: Checkpoint Ownership
    For long-running UDFs, fragments are incrementally saved to a UDFCheckpointStore. Once a fragment is successfully inserted into the checkpoint, it is explicitly pop()-ed from the local fragments_to_cleanup list. This ensures that if a subsequent step fails, we do not physically delete data files that the checkpoint relies on for resumption.

Related Issues

Closes #7075

Scope & Follow-up

This PR tightly scopes the fixes to the add_columns pipeline.
Note on alter_columns: While alter_columns shares the newly improved add_columns_impl machinery (and thus benefits from the unfinished writer cleanup), its final apply_commit stage currently drops the fragments_to_cleanup vector. A commit failure there could still leave orphaned files. To keep this PR focused and reviewable, the commit-failure cleanup for alter_columns will be addressed in a separate, dedicated follow-up PR.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Testing Performed

  • Rust Integration Tests: Added parameterized tests covering inline, packed, dedicated, and external cleanup scenarios. Validated stream errors, UDF panics, and checkpoint lookup/insert anomalies to ensure zero orphan files and zero deleted external files.
  • Python Extension Tests: Verified add_columns with RecordBatchReader and BatchUDF across all Blob v2 kinds. Asserted bit-for-bit data integrity upon reading back the mutated dataset.
  • Passed all formatting (cargo fmt), linting (cargo clippy), and existing test suites.

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@github-actions github-actions Bot added A-python Python bindings bug Something isn't working labels Jun 8, 2026
@yyzhao2025 yyzhao2025 force-pushed the yyzhao2025/add_columns_blobv2 branch from df5072c to 89bd69a Compare June 8, 2026 13:31
@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 8, 2026

Codecov Report

❌ Patch coverage is 88.30769% with 76 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/dataset/schema_evolution.rs 88.10% 39 Missing and 32 partials ⚠️
rust/lance/src/dataset/write.rs 86.66% 3 Missing and 1 partial ⚠️
rust/lance/src/dataset/fragment.rs 0.00% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-python Python bindings bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

add_columns should support Blob v2 external URIs and clean up failed writes

1 participant