Cloud storage support (Azure, S3, GCS)#1087
Cloud storage support (Azure, S3, GCS)#1087SamirMoustafa wants to merge 33 commits intoscverse:mainfrom
Conversation
Patch da.to_zarr so ome_zarr's **kwargs are forwarded as zarr_array_kwargs, avoiding FutureWarning and keeping behavior correct.
- _FsspecStoreRoot, _get_store_root for path-like store roots (local + fsspec) - _storage_options_from_fs for parquet writes to Azure/S3/GCS - _remote_zarr_store_exists, _ensure_async_fs for UPath/FsspecStore - Extend _resolve_zarr_store for UPath and _FsspecStoreRoot with async fs - _backed_elements_contained_in_path, _is_element_self_contained accept UPath
- path and _path accept Path | UPath; setter allows UPath - write() accepts file_path: str | Path | UPath | None (None uses path) - _validate_can_safely_write_to_path handles UPath and remote store existence - _write_element accepts Path | UPath; skip local subfolder checks for UPath - __repr__ and _get_groups_for_element use path without forcing Path()
…table, zarr - Resolve store via _resolve_zarr_store in read paths (points, shapes, raster, table) - Use _get_store_root for parquet paths; read/write parquet with storage_options for fsspec - io_shapes: upload parquet to Azure/S3/GCS via temp file when path is _FsspecStoreRoot - io_zarr: _get_store_root, UPath in _get_groups_for_element and _write_consolidated_metadata; set sdata.path to UPath when store is remote
- pyproject.toml: adlfs, gcsfs, moto[server], pytest-timeout in test extras - Dockerfile.emulators: moto, Azurite, fake-gcs-server for tests/io/remote_storage/
… emulator config - full_sdata fixture: two regions for table categorical (avoids 404 on remote read) - tests/io/remote_storage/conftest.py: bucket/container creation, resilient async shutdown - tests/io/remote_storage/test_remote_storage.py: parametrized Azure/S3/GCS roundtrip and write tests
- Added "dimension_separator" to the frozenset of internal keys that should not be passed to zarr.Group.create_array(), ensuring compatibility with various zarr versions. - Updated test to set region labels for full_sdata table, allowing the test_set_table_annotates_spatialelement to succeed without errors.
- Updated the `test_subset` function to exclude labels and poly from the default table, ensuring accurate subset validation. - Enhanced `test_validate_table_in_spatialdata` to assert that both regions (labels2d and poly) are correctly annotated in the table. - Adjusted `test_labels_table_joins` to restrict the table to labels2d, ensuring the join returns the expected results.
…inux - Added steps to build and run storage emulators (S3, Azure, GCS) using Docker, specifically for the Ubuntu environment. - Implemented a wait mechanism to ensure emulators are ready before running tests. - Adjusted test execution to skip remote storage tests on non-Linux platforms.
- Wrapped the fsspec async sync function to prevent RuntimeError "Loop is not running" during process exit when using remote storage (Azure, S3, GCS). - Ensured compatibility with async session management in the _utils module.
for more information, see https://pre-commit.ci
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1087 +/- ##
==========================================
- Coverage 91.93% 91.74% -0.19%
==========================================
Files 51 51
Lines 7772 8008 +236
==========================================
+ Hits 7145 7347 +202
- Misses 627 661 +34
🚀 New features to boost your workflow:
|
…cols and improving storage options handling for parquet files.
… or UPath, and add tests to verify correct coercion of string paths to appropriate types.
|
…g for unsupported protocols in storage options, and add test cases to validate new functionality and ensure compatibility with cloud object store protocols.
for more information, see https://pre-commit.ci
| if isinstance(store, UPath): | ||
| sdata.path = store | ||
| elif isinstance(store, str): | ||
| sdata.path = UPath(store) if "://" in store else Path(store) | ||
| elif isinstance(store, Path): | ||
| sdata.path = store | ||
| elif isinstance(store, zarr.Group): | ||
| if isinstance(resolved_store, LocalStore): | ||
| sdata.path = Path(resolved_store.root) | ||
| elif isinstance(resolved_store, FsspecStore): | ||
| sdata.path = UPath(str(_FsspecStoreRoot(resolved_store))) | ||
| else: |
There was a problem hiding this comment.
Reading from a remote zarr.Group still loses the original storage options when sdata.path is reconstructed. In the zarr.Group branch you rebuild the path from str(_FsspecStoreRoot(...)), which throws away the original fs config. That means a later write() / write_element() on the returned SpatialData can no longer round-trip to Azure/GCS unless the credentials happen to come from global env.
for more information, see https://pre-commit.ci
|
Here I also added suggestions for SamirMoustafa#1 so that we can explicitly use a zarr store. Get's rid of the |
…load configurations, and refining test execution conditions for different operating systems.
| def _parse_fsspec_remote_path(path: _FsspecStoreRoot) -> tuple[str, str]: | ||
| """Return (bucket_or_container, blob_key) from an fsspec store path.""" | ||
| remote = str(path) | ||
| if "://" in remote: | ||
| remote = remote.split("://", 1)[1] | ||
| parts = remote.split("/", 1) | ||
| bucket_or_container = parts[0] | ||
| blob_key = parts[1] if len(parts) > 1 else "" | ||
| return bucket_or_container, blob_key | ||
|
|
||
|
|
||
| def _upload_parquet_to_azure(tmp_path: str, bucket: str, key: str, fs: Any) -> None: | ||
| from azure.storage.blob import BlobServiceClient | ||
|
|
||
| client = BlobServiceClient.from_connection_string(fs.connection_string) | ||
| blob_client = client.get_blob_client(container=bucket, blob=key) | ||
| with open(tmp_path, "rb") as f: | ||
| blob_client.upload_blob(f, overwrite=True) | ||
|
|
||
|
|
||
| def _upload_parquet_to_s3(tmp_path: str, bucket: str, key: str, fs: Any) -> None: | ||
| import boto3 | ||
|
|
||
| endpoint = getattr(fs, "endpoint_url", None) or os.environ.get("AWS_ENDPOINT_URL") | ||
| s3 = boto3.client( | ||
| "s3", | ||
| endpoint_url=endpoint, | ||
| aws_access_key_id=getattr(fs, "key", None) or os.environ.get("AWS_ACCESS_KEY_ID"), | ||
| aws_secret_access_key=getattr(fs, "secret", None) or os.environ.get("AWS_SECRET_ACCESS_KEY"), | ||
| region_name=os.environ.get("AWS_DEFAULT_REGION", "us-east-1"), | ||
| ) | ||
| s3.upload_file(tmp_path, bucket, key) | ||
|
|
||
|
|
||
| def _upload_parquet_to_fsspec(path: _FsspecStoreRoot, tmp_path: str) -> None: | ||
| """Upload local parquet file to remote fsspec store using sync APIs to avoid event-loop issues.""" | ||
| fs = path._store.fs | ||
| bucket, key = _parse_fsspec_remote_path(path) | ||
| fs_name = type(fs).__name__ | ||
| if fs_name == "AzureBlobFileSystem" and getattr(fs, "connection_string", None): | ||
| _upload_parquet_to_azure(tmp_path, bucket, key, fs) | ||
| elif fs_name in ("S3FileSystem", "MotoS3FS"): | ||
| _upload_parquet_to_s3(tmp_path, bucket, key, fs) | ||
| elif fs_name == "GCSFileSystem": | ||
| import fsspec | ||
|
|
||
| fs_dict = json.loads(fs.to_json()) | ||
| fs_dict["asynchronous"] = False | ||
| sync_fs = fsspec.AbstractFileSystem.from_json(json.dumps(fs_dict)) | ||
| sync_fs.put_file(tmp_path, path._path) | ||
| else: | ||
| fs.put(tmp_path, str(path)) | ||
|
|
||
|
|
There was a problem hiding this comment.
We shouldn't need to have these parquet sidecars for authentication of each cloud system. If we can have a full zarr-native spatialdata there is no need to re-extract Azure/S3/GCS credentials for parquet APIs and
less distinction between "Zarr backing" and "non-Zarr sidecar next to the Zarr group".
Cloud storage support (Azure, S3, GCS)
Summary
Add read/write support for SpatialData on remote object storage via
UPath, fixing the issue reported in #999 wheresd.read_zarr(UPath("s3://..."))failed becauseSpatialData.pathdid not acceptUPath. PR #971 ("add remote support") pursued the same goal but remains a draft, blocked on zarr v3/ome-zarr and async fsspec after dask unpinning. This PR delivers working remote support by fixing the path setter, wrapping fsspec in an async filesystem where required for current zarr, and testing Azure, S3, and GCS via Docker emulators. It also addresses #441 (private remote object storage): credentials go viaUPathkwargs or via a pre-openedzarr.Group(e.g.read_zarr(zarr.open("s3://...", storage_options={...}))). Fixes #441.Supported features
SpatialData.pathacceptsNone,str,Path, orUPath(enables remote-backed objects).SpatialData.read(upath)andread_zarr(upath)for Azure Blob (az://), S3 (s3://), and GCS (gs://) using universal-pathlib (UPath). For private stores,read_zarr(zarr_group)is also supported when the store is opened withzarr.open(..., storage_options=...).sdata.write(upath)and element-level writes to the same backends; parquet (points/shapes) and zarr (raster/tables) written via fsspec with async filesystem support where required.zmetadata) supported.Testing
Remote storage is tested with Docker-based emulators (Azurite for Azure, moto for S3, fake-gcs-server for GCS). In CI we build
tests/io/remote_storage/Dockerfile.emulators, start the emulators on Ubuntu, then run the full test suite includingtests/io/remote_storage/. These remote-storage tests run only on Ubuntu (Linux), because they depend on Docker; on Windows and macOS we skiptests/io/remote_storage/and run the rest of the suite. To run the remote tests locally you need Docker and can start the emulators with the same image and ports (5000, 10000, 4443) as in the workflow.Example (three providers)
Credentials and options are passed through
UPath(e.g.connection_string,endpoint_url,anon,token,project) as supported by the underlying fsspec backend.Release notes
UPath.SpatialData.pathnow acceptsUPathin addition tostrandPath. Fixes initialization from remote stores (e.g. S3) as in #999. Fixes #441 (private remote object storage).