fix: decline native V1 scans on object_store-unsupported filesystem schemes by schenksj · Pull Request #4525 · apache/datafusion-comet

schenksj · 2026-05-30T00:49:07Z

Which issue does this PR close?

Closes #4520.

Rationale for this change

Comet's native readers go through object_store, which only understands a fixed set of URL schemes. When a scan's path uses a custom Hadoop FileSystem scheme (e.g. registered via spark.hadoop.fs.<scheme>.impl), the native reader fails at execution with Generic URL error: Unable to recognise URL "..." — there is no graceful recovery once native execution has started. This was surfaced by Delta tables opened with custom filesystem options (DeltaTable.forPath(spark, path, fsOptions)), where Delta reads its internal _delta_log/*.checkpoint.parquet via ordinary V1 parquet scans that Comet then claimed and crashed on, but it reproduces for any V1 parquet scan on such a scheme.

What changes are included in this PR?

CometScanRule declines a V1 native scan when its root-path scheme isn't natively readable, so Spark's Hadoop-FS-aware reader handles it. Rather than hardcode the object_store-supported scheme set in the planner (a mirror that drifts), the answer comes from the native layer itself: a new NativeBase.isObjectStoreSchemeSupported JNI method backed by object_store's own ObjectStoreScheme::parse — the same path prepare_object_store_with_configs dispatches through. The user's libhdfs scheme config (spark.hadoop.fs.comet.libhdfs.schemes) is unioned in on the JVM side; results are cached per scheme; and if native can't be consulted the scheme is assumed supported rather than over-restricting.

How are these changes tested?

CometScanSchemeFallbackSuite registers FakeHDFSFileSystem for a fake:// scheme (not routed through libhdfs) and applies CometScanRule to the scan's physical plan. It asserts the scan falls back to Spark (no CometScanExec). The test fails without the gate (Comet claims the fake:// scan) and passes with it. The libhdfs-scheme regression guard (ParquetReadFromFakeHadoopFsSuite) continues to engage Comet for configured libhdfs schemes.

…chemes Comet's native readers go through object_store, which only understands a fixed set of URL schemes. A custom Hadoop FileSystem (e.g. registered via spark.hadoop.fs.<scheme>.impl) crashes the native reader at execution with "Generic URL error: Unable to recognise URL", with no graceful recovery. Decline such scans at planning time so Spark's Hadoop-FS-aware reader handles them. Whether object_store recognizes a scheme is answered by the native layer itself (NativeBase.isObjectStoreSchemeSupported, backed by object_store's ObjectStoreScheme::parse -- the same path prepare_object_store_with_configs uses) rather than a hardcoded list, so the planner can't drift from object_store's actual support. The user's libhdfs scheme config (spark.hadoop.fs.comet.libhdfs.schemes) is unioned in on the JVM side; results are cached per scheme; if native can't be consulted the scheme is assumed supported rather than over-restricting. Adds CometScanSchemeFallbackSuite, which asserts a `fake://` scan falls back to Spark; it fails without the gate (Comet claims the scan) and passes with it. Closes apache#4520 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

check-suites.py requires every *Suite.scala to appear in both pr_build_linux.yml and pr_build_macos.yml. Add the new CometScanSchemeFallbackSuite alongside its sibling org.apache.comet.rules suites. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

andygrove · 2026-05-30T20:43:22Z

+  // Per-scheme memo of `NativeBase.isObjectStoreSchemeSupported`. The answer depends only on the
+  // URL scheme, so we cache by scheme and never re-cross the JNI boundary for a repeated scheme.
+  private val schemeSupportCache =
+    new java.util.concurrent.ConcurrentHashMap[String, java.lang.Boolean]()


please add imports rather than use fully qualified class names

Done. Added imports for java.net.URI, java.util.Locale, java.util.concurrent.ConcurrentHashMap, and java.lang.Boolean (aliased JBoolean) and dropped the fully-qualified references. Also fixed a leftover withInfo call in the same code (renamed to withFallbackReason in #4508) that was breaking compilation after merging main.

…la 2.12 SQLTestUtils.withSQLConf returns Unit on Spark 3.5 but a value on Spark 4.x, so assigning its block result to `val sparkPlan: SparkPlan` failed to compile under the spark-3.5 profile (type mismatch: found Unit, required SparkPlan). Capture the plan via a var assigned inside the block, which is cross-version-safe. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Address review feedback: import java.lang.Boolean (as JBoolean), java.net.URI, java.util.Locale and java.util.concurrent.ConcurrentHashMap rather than referencing them with fully-qualified class names in the newly-added scheme-gating code. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…rename) The unsupported-scheme fallback still called withInfo, the old name of withFallbackReason (renamed in apache#4508). It was the only remaining old-name call in the file and broke compilation after merging main; rename it to match the rest of CometScanRule. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

andygrove · 2026-05-31T13:52:05Z

Nice approach overall. Sourcing "can the native reader handle this scheme?" from object_store's own ObjectStoreScheme::parse instead of a hardcoded list is the right call, and for every scheme that flows through parse_url natively the gate is exact by construction.

I think there is one gap worth a look before merge: plain hdfs://.

On the native side, is_hdfs_scheme in parquet_support.rs returns true for scheme == "hdfs" whenever fs.comet.libhdfs.schemes is unset, and create_hdfs_object_store is compiled into the default build (default = ["hdfs-opendal"] in native/core/Cargo.toml). So the native reader handles hdfs:// out of the box.

On the JVM side, COMET_LIBHDFS_SCHEMES has no default, so when it is unset libhdfsSchemes is empty. For hdfs, the decline condition then reduces to !isNativelyReadableScheme(uri), and object_store has no hdfs scheme, so that helper returns false. The net effect is that a plain hdfs:// V1 scan gets declined and falls back to Spark, even though native could read it.

I built the branch with a default native library and probed the gate's helper directly to confirm:

isNativelyReadableScheme(hdfs://namenode:8020/...) = false   // declined
isNativelyReadableScheme(s3a://bucket/key)         = true    // ok
isNativelyReadableScheme(file:/tmp/data)           = true    // ok

So s3a and file stay consistent (they bypass parse_url natively but object_store recognizes them anyway). Only hdfs diverges, and it diverges in the default HDFS configuration rather than an exotic one, so this looks like a silent fallback regression for HDFS users.

Would it make sense to mirror the native default on the JVM, so the two stay in lockstep?

val libhdfsSchemes: Set[String] = COMET_LIBHDFS_SCHEMES.get() match {
  case Some(s) => s.split(",").map(_.trim.toLowerCase(Locale.ROOT)).filter(_.nonEmpty).toSet
  case None    => Set("hdfs") // native is_hdfs_scheme defaults to `scheme == "hdfs"` when unset
}

A test asserting that an hdfs:// root path with the config unset is still claimed by Comet would lock this in, alongside the existing fake:// decline case.

One smaller thing: is the V2 BatchScanExec path susceptible to the same Unable to recognise URL failure on custom schemes, or is this intentionally V1-only? A note or follow-up issue would help.

Disclosure: I used Claude Code to help review this PR, including building the branch and running the scheme-gate probe above.

Address review feedback on apache#4525. When `spark.hadoop.fs.comet.libhdfs.schemes` is unset, the scheme gate now defaults `libhdfsSchemes` to `Set("hdfs")` rather than the empty set, mirroring the native default: `is_hdfs_scheme` (parquet_support.rs) treats `hdfs` as natively readable when the config is unset, and `create_hdfs_object_store` is in the default build (`default = ["hdfs-opendal"]`). Previously a plain `hdfs://` V1 scan was declined and silently fell back to Spark in the default HDFS configuration even though native could read it. `s3a`/`file` are unaffected (object_store recognizes them via `parse_url`); an explicit config value still takes over verbatim. Test: add `native scan claims hdfs:// when libhdfs.schemes is unset` to CometScanSchemeFallbackSuite, alongside the existing `fake://` decline case. It backs the `hdfs` scheme with a local FS (FakeHdfsSchemeFileSystem) so an `hdfs://` path is exercised without a live cluster, then asserts CometScanRule claims the scan. Verified RED (fails with `Set.empty`: scan falls back to Spark) -> GREEN (passes with `Set("hdfs")`) on Spark 3.5. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

schenksj · 2026-05-31T18:32:17Z

Thanks @andygrove — and for building the branch and probing the gate directly; that hdfs divergence is a real silent-fallback regression and worth closing before merge.

hdfs:// default. Adopted your suggestion: when spark.hadoop.fs.comet.libhdfs.schemes is unset, libhdfsSchemes now defaults to Set("hdfs"), mirroring the native is_hdfs_scheme default (scheme == "hdfs" when the config is unset) and the default ["hdfs-opendal"] build. So a plain hdfs:// V1 scan stays claimed by Comet instead of silently falling back. s3a/file are unaffected (object_store recognizes them via parse_url), and an explicit config still takes over verbatim.

Test. Added native scan claims hdfs:// when libhdfs.schemes is unset to CometScanSchemeFallbackSuite, alongside the existing fake:// decline case. It backs the hdfs scheme with a local FS (FakeHdfsSchemeFileSystem, RawLocalFileSystem reporting getScheme = "hdfs") so an hdfs:// path is exercised without a live cluster, then applies CometScanRule to the plan and asserts the scan is claimed (a CometScanExec, no leftover FileSourceScanExec). It's a real guard: it fails with the old case None => Set.empty (hdfs declined) and passes with the Set("hdfs") default.

V2 BatchScanExec. Intentionally V1-only here — the gate lives in the FileSourceScanExec path (nativeScan). The V2 native paths Comet currently claims are CSV-V2 and Iceberg; Iceberg resolves IO through its own FileIO rather than the V1 rootPaths → parse_url route, so it doesn't hit the same Unable to recognise URL. If/when a native Parquet-V2 scan lands it should get a parallel scheme gate — happy to file a follow-up issue to track that.

Also fixed a latent compile break the gate carried: the decline branch still called withInfo, which #4508 renamed to withFallbackReason — updated to match.

schenksj and others added 3 commits May 29, 2026 20:44

Merge branch 'main' into fix/scheme-gate-object-store

8c1c260

andygrove reviewed May 30, 2026

View reviewed changes

schenksj and others added 3 commits May 30, 2026 17:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: decline native V1 scans on object_store-unsupported filesystem schemes#4525

fix: decline native V1 scans on object_store-unsupported filesystem schemes#4525
schenksj wants to merge 7 commits into
apache:mainfrom
schenksj:fix/scheme-gate-object-store

schenksj commented May 30, 2026

Uh oh!

andygrove May 30, 2026

Uh oh!

schenksj May 30, 2026

Uh oh!

andygrove commented May 31, 2026

Uh oh!

schenksj commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

schenksj commented May 30, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

andygrove May 30, 2026

Choose a reason for hiding this comment

Uh oh!

schenksj May 30, 2026

Choose a reason for hiding this comment

Uh oh!

andygrove commented May 31, 2026

Uh oh!

schenksj commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants