Skip to content

fix: decline native V1 scans on object_store-unsupported filesystem schemes#4525

Open
schenksj wants to merge 7 commits into
apache:mainfrom
schenksj:fix/scheme-gate-object-store
Open

fix: decline native V1 scans on object_store-unsupported filesystem schemes#4525
schenksj wants to merge 7 commits into
apache:mainfrom
schenksj:fix/scheme-gate-object-store

Conversation

@schenksj
Copy link
Copy Markdown

Which issue does this PR close?

Closes #4520.

Rationale for this change

Comet's native readers go through object_store, which only understands a fixed set of URL schemes. When a scan's path uses a custom Hadoop FileSystem scheme (e.g. registered via spark.hadoop.fs.<scheme>.impl), the native reader fails at execution with Generic URL error: Unable to recognise URL "..." — there is no graceful recovery once native execution has started. This was surfaced by Delta tables opened with custom filesystem options (DeltaTable.forPath(spark, path, fsOptions)), where Delta reads its internal _delta_log/*.checkpoint.parquet via ordinary V1 parquet scans that Comet then claimed and crashed on, but it reproduces for any V1 parquet scan on such a scheme.

What changes are included in this PR?

CometScanRule declines a V1 native scan when its root-path scheme isn't natively readable, so Spark's Hadoop-FS-aware reader handles it. Rather than hardcode the object_store-supported scheme set in the planner (a mirror that drifts), the answer comes from the native layer itself: a new NativeBase.isObjectStoreSchemeSupported JNI method backed by object_store's own ObjectStoreScheme::parse — the same path prepare_object_store_with_configs dispatches through. The user's libhdfs scheme config (spark.hadoop.fs.comet.libhdfs.schemes) is unioned in on the JVM side; results are cached per scheme; and if native can't be consulted the scheme is assumed supported rather than over-restricting.

How are these changes tested?

CometScanSchemeFallbackSuite registers FakeHDFSFileSystem for a fake:// scheme (not routed through libhdfs) and applies CometScanRule to the scan's physical plan. It asserts the scan falls back to Spark (no CometScanExec). The test fails without the gate (Comet claims the fake:// scan) and passes with it. The libhdfs-scheme regression guard (ParquetReadFromFakeHadoopFsSuite) continues to engage Comet for configured libhdfs schemes.

schenksj and others added 3 commits May 29, 2026 20:44
…chemes

Comet's native readers go through object_store, which only understands a fixed set
of URL schemes. A custom Hadoop FileSystem (e.g. registered via
spark.hadoop.fs.<scheme>.impl) crashes the native reader at execution with
"Generic URL error: Unable to recognise URL", with no graceful recovery. Decline
such scans at planning time so Spark's Hadoop-FS-aware reader handles them.

Whether object_store recognizes a scheme is answered by the native layer itself
(NativeBase.isObjectStoreSchemeSupported, backed by object_store's
ObjectStoreScheme::parse -- the same path prepare_object_store_with_configs uses)
rather than a hardcoded list, so the planner can't drift from object_store's actual
support. The user's libhdfs scheme config (spark.hadoop.fs.comet.libhdfs.schemes) is
unioned in on the JVM side; results are cached per scheme; if native can't be
consulted the scheme is assumed supported rather than over-restricting.

Adds CometScanSchemeFallbackSuite, which asserts a `fake://` scan falls back to
Spark; it fails without the gate (Comet claims the scan) and passes with it.

Closes apache#4520

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
check-suites.py requires every *Suite.scala to appear in both
pr_build_linux.yml and pr_build_macos.yml. Add the new
CometScanSchemeFallbackSuite alongside its sibling
org.apache.comet.rules suites.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
// Per-scheme memo of `NativeBase.isObjectStoreSchemeSupported`. The answer depends only on the
// URL scheme, so we cache by scheme and never re-cross the JNI boundary for a repeated scheme.
private val schemeSupportCache =
new java.util.concurrent.ConcurrentHashMap[String, java.lang.Boolean]()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add imports rather than use fully qualified class names

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Added imports for java.net.URI, java.util.Locale, java.util.concurrent.ConcurrentHashMap, and java.lang.Boolean (aliased JBoolean) and dropped the fully-qualified references. Also fixed a leftover withInfo call in the same code (renamed to withFallbackReason in #4508) that was breaking compilation after merging main.

schenksj and others added 3 commits May 30, 2026 17:45
…la 2.12

SQLTestUtils.withSQLConf returns Unit on Spark 3.5 but a value on Spark 4.x, so
assigning its block result to `val sparkPlan: SparkPlan` failed to compile under
the spark-3.5 profile (type mismatch: found Unit, required SparkPlan). Capture
the plan via a var assigned inside the block, which is cross-version-safe.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Address review feedback: import java.lang.Boolean (as JBoolean),
java.net.URI, java.util.Locale and java.util.concurrent.ConcurrentHashMap
rather than referencing them with fully-qualified class names in the
newly-added scheme-gating code.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rename)

The unsupported-scheme fallback still called withInfo, the old name of
withFallbackReason (renamed in apache#4508). It was the only remaining old-name
call in the file and broke compilation after merging main; rename it to
match the rest of CometScanRule.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@andygrove
Copy link
Copy Markdown
Member

Nice approach overall. Sourcing "can the native reader handle this scheme?" from object_store's own ObjectStoreScheme::parse instead of a hardcoded list is the right call, and for every scheme that flows through parse_url natively the gate is exact by construction.

I think there is one gap worth a look before merge: plain hdfs://.

On the native side, is_hdfs_scheme in parquet_support.rs returns true for scheme == "hdfs" whenever fs.comet.libhdfs.schemes is unset, and create_hdfs_object_store is compiled into the default build (default = ["hdfs-opendal"] in native/core/Cargo.toml). So the native reader handles hdfs:// out of the box.

On the JVM side, COMET_LIBHDFS_SCHEMES has no default, so when it is unset libhdfsSchemes is empty. For hdfs, the decline condition then reduces to !isNativelyReadableScheme(uri), and object_store has no hdfs scheme, so that helper returns false. The net effect is that a plain hdfs:// V1 scan gets declined and falls back to Spark, even though native could read it.

I built the branch with a default native library and probed the gate's helper directly to confirm:

isNativelyReadableScheme(hdfs://namenode:8020/...) = false   // declined
isNativelyReadableScheme(s3a://bucket/key)         = true    // ok
isNativelyReadableScheme(file:/tmp/data)           = true    // ok

So s3a and file stay consistent (they bypass parse_url natively but object_store recognizes them anyway). Only hdfs diverges, and it diverges in the default HDFS configuration rather than an exotic one, so this looks like a silent fallback regression for HDFS users.

Would it make sense to mirror the native default on the JVM, so the two stay in lockstep?

val libhdfsSchemes: Set[String] = COMET_LIBHDFS_SCHEMES.get() match {
  case Some(s) => s.split(",").map(_.trim.toLowerCase(Locale.ROOT)).filter(_.nonEmpty).toSet
  case None    => Set("hdfs") // native is_hdfs_scheme defaults to `scheme == "hdfs"` when unset
}

A test asserting that an hdfs:// root path with the config unset is still claimed by Comet would lock this in, alongside the existing fake:// decline case.

One smaller thing: is the V2 BatchScanExec path susceptible to the same Unable to recognise URL failure on custom schemes, or is this intentionally V1-only? A note or follow-up issue would help.

Disclosure: I used Claude Code to help review this PR, including building the branch and running the scheme-gate probe above.

Address review feedback on apache#4525. When `spark.hadoop.fs.comet.libhdfs.schemes`
is unset, the scheme gate now defaults `libhdfsSchemes` to `Set("hdfs")` rather
than the empty set, mirroring the native default: `is_hdfs_scheme`
(parquet_support.rs) treats `hdfs` as natively readable when the config is unset,
and `create_hdfs_object_store` is in the default build (`default = ["hdfs-opendal"]`).

Previously a plain `hdfs://` V1 scan was declined and silently fell back to Spark
in the default HDFS configuration even though native could read it. `s3a`/`file`
are unaffected (object_store recognizes them via `parse_url`); an explicit config
value still takes over verbatim.

Test: add `native scan claims hdfs:// when libhdfs.schemes is unset` to
CometScanSchemeFallbackSuite, alongside the existing `fake://` decline case. It
backs the `hdfs` scheme with a local FS (FakeHdfsSchemeFileSystem) so an `hdfs://`
path is exercised without a live cluster, then asserts CometScanRule claims the
scan. Verified RED (fails with `Set.empty`: scan falls back to Spark) -> GREEN
(passes with `Set("hdfs")`) on Spark 3.5.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@schenksj
Copy link
Copy Markdown
Author

Thanks @andygrove — and for building the branch and probing the gate directly; that hdfs divergence is a real silent-fallback regression and worth closing before merge.

hdfs:// default. Adopted your suggestion: when spark.hadoop.fs.comet.libhdfs.schemes is unset, libhdfsSchemes now defaults to Set("hdfs"), mirroring the native is_hdfs_scheme default (scheme == "hdfs" when the config is unset) and the default ["hdfs-opendal"] build. So a plain hdfs:// V1 scan stays claimed by Comet instead of silently falling back. s3a/file are unaffected (object_store recognizes them via parse_url), and an explicit config still takes over verbatim.

Test. Added native scan claims hdfs:// when libhdfs.schemes is unset to CometScanSchemeFallbackSuite, alongside the existing fake:// decline case. It backs the hdfs scheme with a local FS (FakeHdfsSchemeFileSystem, RawLocalFileSystem reporting getScheme = "hdfs") so an hdfs:// path is exercised without a live cluster, then applies CometScanRule to the plan and asserts the scan is claimed (a CometScanExec, no leftover FileSourceScanExec). It's a real guard: it fails with the old case None => Set.empty (hdfs declined) and passes with the Set("hdfs") default.

V2 BatchScanExec. Intentionally V1-only here — the gate lives in the FileSourceScanExec path (nativeScan). The V2 native paths Comet currently claims are CSV-V2 and Iceberg; Iceberg resolves IO through its own FileIO rather than the V1 rootPaths → parse_url route, so it doesn't hit the same Unable to recognise URL. If/when a native Parquet-V2 scan lands it should get a parallel scheme gate — happy to file a follow-up issue to track that.

Also fixed a latent compile break the gate carried: the decline branch still called withInfo, which #4508 renamed to withFallbackReason — updated to match.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CometScanRule: decline native V1 scans on object_store-unsupported filesystem schemes (fall back to Spark)

2 participants